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Abstract 

This tutorial provides an overview of and introduction to Rissanen's Minimum De- 
scription Length (MDL) Principle. The first chapter provides a conceptual, entirely 
non-technical introduction to the subject. It serves as a basis for the technical in- 
troduction given in the second chapter, in which all the ideas of the first chapter 
are made mathematically precise. This tutorial will appear as the first two chapters 
in the collection Advances in Minimum Description Length: Theory and Applications 
|Griinwald, Myung, and Pitt 2004| , to be published by the MIT Press. 
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Chapter 1 

Introducing the MDL Principle 



1.1 Introduction and Overview 

How does one decide among competing explanations of data given limited observations? 
This is the problem of model selection. It stands out as one of the most important 
problems of inductive and statistical inference. The Minimum Description Length 
(MDL) Principle is a relatively recent method for inductive inference that provides a 
generic solution to the model selection problem. MDL is based on the following insight: 
any regularity in the data can be used to compress the data, i.e. to describe it using 
fewer symbols than the number of symbols needed to describe the data literally. The 
more regularities there are, the more the data can be compressed. Equating 'learning' 
with 'finding regularity', we can therefore say that the more we are able to compress 
the data, the more we have learned about the data. Formalizing this idea leads to a 
general theory of inductive inference with several attractive properties: 

1. Occam's Razor MDL chooses a model that trades-off goodness-of-fit on the ob- 

served data with 'complexity' or 'richness' of the model. As such, MDL embodies 
a form of Occam's Razor, a principle that is both intuitively appealing and infor- 
mally applied throughout all the sciences. 

2. No overfitting, automatically MDL procedures automatically and inherently pro- 

tect against overfitting and can be used to estimate both the parameters and the 
structure (e.g., number of parameters) of a model. In contrast, to avoid over- 
fitting when estimating the structure of a model, traditional methods such as 
maximum likelihood must be modified and extended with additional, typically ad 
hoc principles. 

3. Bayesian interpretation MDL is closely related to Bayesian inference, but avoids 

some of the interpretation difficulties of the Bayesian approach 1 , especially in the 
realistic case when it is known a priori to the modeler that none of the models 
under consideration is true. In fact: 



4. No need for 'underlying truth' In contrast to other statistical methods, MDL 

procedures have a clear interpretation independent of whether or not there exists 
some underlying 'true' model. 

5. Predictive interpretation Because data compression is formally equivalent to a 

form of probabilistic prediction, MDL methods can be interpreted as searching 
for a model with good predictive performance on unseen data. 

In this chapter, we introduce the MDL Principle in an entirely non-technical way, 
concentrating on its most important applications, model selection and avoiding over- 
fitting. In Section fl. 21 we discuss the relation between learning and data compression. 
Section H .31 introduces model selection and outlines a first, 'crude' version of MDL that 
can be applied to model selection. Section H~4l indicates how these 'crude' ideas need 
to be refined to tackle small sample sizes and differences in model complexity between 
models with the same number of parameters. Section fl.5l discusses the philosophy un- 
derlying MDL, and considers its relation to Occam's Razor. Section ri.7l brieflv discusses 
the history of MDL. All this is summarized in Section f 1.81 

1.2 The Fundamental Idea: 

Learning as Data Compression 

We are interested in developing a method for learning the laws and regularities in data. 
The following example will illustrate what we mean by this and give a first idea of how 
it can be related to descriptions of data. 

Regularity . . . Consider the following three sequences. We assume that each se- 
quence is 10000 bits long, and we just list the beginning and the end of each sequence. 

00010001000100010001 ... 0001000100010001000100010001 (1.1) 

01110100110100100110 ... 1010111010111011000101100010 (1.2) 

00011000001010100000 ... 0010001000010000001000110000 (1.3) 

The first of these three sequences is a 2500- fold repetition of 0001. Intuitively, the 
sequence looks regular; there seems to be a simple 'law' underlying it; it might make 
sense to conjecture that future data will also be subject to this law, and to predict 
that future data will behave according to this law. The second sequence has been 
generated by tosses of a fair coin. It is intuitively speaking as 'random as possible', and 
in this sense there is no regularity underlying it. Indeed, we cannot seem to find such a 
regularity either when we look at the data. The third sequence contains approximately 
four times as many Os as Is. It looks less regular, more random than the first; but it 
looks less random than the second. There is still some discernible regularity in these 
data, but of a statistical rather than of a deterministic kind. Again, noticing that such 
a regularity is there and predicting that future data will behave according to the same 
regularity seems sensible. 



...and Compression We claimed that any regularity detected in the data can be used 
to compress the data, i.e. to describe it in a short manner. Descriptions are always 
relative to some description method which maps descriptions D' in a unique manner to 
data sets D. A particularly versatile description method is a general-purpose computer 
language like C or PASCAL. A description of D is then any computer program that 
prints D and then halts. Let us see whether our claim works for the three sequences 
above. Using a language similar to Pascal, we can write a program 

for i = 1 to 2500; print '0001'; next; halt 

which prints sequence (1) but is clearly a lot shorter. Thus, sequence (1) is indeed 
highly compressible. On the other hand, we show in Section 12.21 that if one generates 
a sequence like (2) by tosses of a fair coin, then with extremely high probability, the 
shortest program that prints (2) and then halts will look something like this: 

print '011101001101000010101010 1010111010111011000101100010'; halt 

This program's size is about equal to the length of the sequence. Clearly, it does nothing 
more than repeat the sequence. 

The third sequence lies in between the first two: generalizing n = 10000 to arbitrary 
length n, we show in Section l2~2l that the first sequence can be compressed to O(logn) 
bits; with overwhelming probability, the second sequence cannot be compressed at all; 
and the third sequence can be compressed to some length an, with < a < 1. 

Example 1.1 [compressing various regular sequences] The regularities underly- 
ing sequences (1) and (3) were of a very particular kind. To illustrate that any type 
of regularity in a sequence may be exploited to compress that sequence, we give a few 
more examples: 

The Number tt Evidently, there exists a computer program for generating the first 
n digits of tt - such a program could be based, for example, on an infinite se- 
ries expansion of it. This computer program has constant size, except for the 
specification of n which takes no more than O(logn) bits. Thus, when n is very 
large, the size of the program generating the first n digits of tt will be very small 
compared to n: the 7r-digit sequence is deterministic, and therefore extremely 
regular. 

Physics Data Consider a two-column table where the first column contains numbers 
representing various heights from which an object was dropped. The second col- 
umn contains the corresponding times it took for the object to reach the ground. 
Assume both heights and times are recorded to some finite precision. In Sec- 
tion 11.31 we illustrate that such a table can be substantially compressed by first 
describing the coefficients of the second-degree polynomial H that expresses New- 
ton's law; then describing the heights; and then describing the deviation of the 
time points from the numbers predicted by H . 



Natural Language Most sequences of words are not valid sentences according to the 
English language. This fact can be exploited to substantially compress English 
text, as long as it is syntactically mostly correct: by first describing a grammar 
for English, and then describing an English text D with the help of that grammar 
|Griinwald 1996J . D can be described using much less bits than are needed without 
the assumption that word order is constrained. 

1.2.1 Kolmogorov Complexity and Ideal MDL 

To formalize our ideas, we need to decide on a description method, that is, a formal 
language in which to express properties of the data. The most general choice is a 
general-purpose 2 computer language such as C or Pascal. This choice leads to the 
definition of the Kolmogorov Complexity Li and Vitanyi 1997 of a sequence as the 



length of the shortest program that prints the sequence and then halts. The lower 
the Kolmogorov complexity of a sequence, the more regular it is. This notion seems 
to be highly dependent on the particular computer language used. However, it turns 
out that for every two general-purpose programming languages A and B and every 
data sequence D, the length of the shortest program for D written in language A and 
the length of the shortest program for D written in language B differ by no more 
than a constant c, which does not depend on the length of D. This so-called invari- 
ance theorem says that, as long as the sequence D is long enough, it is not essential 
which computer language one chooses, as long as it is general-purpose. Kolmogorov 
complexity was introduced, and the invariance theorem was proved, independently by 
Kolmogorov [1965], |Chaitin [1969] and Solomonoff [1964 . Solomonoff's paper, called 



A Theory of Inductive Inference, contained the idea that the ultimate model for a 
sequence of data may be identified with the shortest program that prints the data. 
Solomonoff's ideas were later extended by several authors, leading to an 'idealized' ver- 
sion of MDL jSolomonoff T9781 |Li and Vitanyi 1997| |Gacs, Tromp, and Vitanyi 2001 



This idealized MDL is very general in scope, but not practically applicable, for the 
following two reasons: 

1. uncomputability It can be shown that there exists no computer program that, 
for every set of data D, when given D as input, returns the shortest program that 
prints D |Li and Vitanyi 1997 . 



2. arbitrariness/dependence on syntax In practice we are confronted with small 
data samples for which the invariance theorem does not say much. Then the 
hypothesis chosen by idealized MDL may depend on arbitrary details of the syntax 
of the programming language under consideration. 

1.2.2 Practical MDL 

Like most authors in the field, we concentrate here on non-idealized, practical versions 
of MDL that explicitly deal with the two problems mentioned above. The basic idea 



is to scale down Solomonoff's approach so that it does become applicable. This is 
achieved by using description methods that are less expressive than general-purpose 
computer languages. Such description methods C should be restrictive enough so that 
for any data sequence D, we can always compute the length of the shortest description 
of D that is attainable using method C; but they should be general enough to allow us 
to compress many of the intuitively 'regular' sequences. The price we pay is that, using 
the 'practical' MDL Principle, there will always be some regular sequences which we 
will not be able to compress. But we already know that there can be no method for 
inductive inference at all which will always give us all the regularity there is — simply 
because there can be no automated method which for any sequence D finds the shortest 
computer program that prints D and then halts. Moreover, it will often be possible to 
guide a suitable choice of C by a priori knowledge we have about our problem domain. 
For example, below we consider a description method C that is based on the class of 
all polynomials, such that with the help of C we can compress all data sets which can 
meaningfully be seen as points on some polynomial. 

1.3 MDL and Model Selection 

Let us recapitulate our main insights so far: 



MDL: The Basic Idea 

The goal of statistical inference may be cast as trying to find regularity in the data. 
'Regularity' may be identified with 'ability to compress'. MDL combines these two 
insights by viewing learning as data compression: it tells us that, for a given set of 
hypotheses H and data set D, we should try to find the hypothesis or combination 
of hypotheses in TL that compresses D most. 



This idea can be applied to all sorts of inductive inference problems, but it turns out to 
be most fruitful in (and its development has mostly concentrated on) problems of model 
selection and, more generally, dealing with overfitting. Here is a standard example (we 
explain the difference between 'model' and 'hypothesis' after the example). 

Example 1.2 [Model Selection and Overfitting] Consider the points in Figure lTTl 
We would like to learn how the y-values depend on the x-values. To this end, we may 
want to fit a polynomial to the points. Straightforward linear regression will give 
us the leftmost polynomial - a straight line that seems overly simple: it does not 
capture the regularities in the data well. Since for any set of n points there exists a 
polynomial of the (n — l)-st degree that goes exactly through all these points, simply 
looking for the polynomial with the least error will give us a polynomial like the one 
in the second picture. This polynomial seems overly complex: it reflects the random 
fluctuations in the data rather than the general pattern underlying it. Instead of picking 
the overly simple or the overly complex polynomial, it seems more reasonable to prefer 





Figure 1.1: A simple, a complex and a trade-off (3rd degree) polynomial. 

a relatively simple polynomial with small but nonzero error, as in the rightmost picture. 
This intuition is confirmed by numerous experiments on real-world data from a broad 
variety of sources [Eissanen 1989| Vapnik 1998} |Ripley 1996 : if one naively fits a high- 
degree polynomial to a small sample (set of data points), then one obtains a very good 
fit to the data. Yet if one tests the inferred polynomial on a second set of data coming 
from the same source, it typically fits this test data very badly in the sense that there 
is a large distance between the polynomial and the new data points. We say that the 
polynomial overfits the data. Indeed, all model selection methods that are used in 
practice either implicitly or explicitly choose a trade-off between goodness-of-fit and 
complexity of the models involved. In practice, such trade-offs lead to much better 
predictions of test data than one would get by adopting the 'simplest' (one degree) 
or most 'complex 3 ' (n — 1-degree) polynomial. MDL provides one particular means of 
achieving such a trade-off. 

It will be useful to make a precise distinction between 'model' and 'hypothesis': 



Models vs. Hypotheses 

We use the phrase point hypothesis to refer to a single probability distribution or 
function. An example is the polynomial 5x 2 + 4x + 3. A point hypothesis is also 
known as a 'simple hypothesis' in the statistical literature. 

We use the word model to refer to a family (set) of probability distributions or 
functions with the same functional form. An example is the set of all second- 
degree polynomials. A model is also known as a 'composite hypothesis' in the 
statistical literature. 

We use hypothesis as a generic term, referring to both point hypotheses and mod- 
els. 



In our terminology, the problem described in Example 11.21 is a 'hypothesis selection 
problem' if we are interested in selecting both the degree of a polynomial and the cor- 
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responding parameters; it is a 'model selection problem' if we are mainly interested in 
selecting the degree. 

To apply MDL to polynomial or other types of hypothesis and model selection, we 
have to make precise the somewhat vague insight 'learning may be viewed as data 
compression'. This can be done in various ways. In this section, we concentrate on the 
earliest and simplest implementation of the idea. This is the so-called two-part code 
version of MDL: 



Crude 4 , Two-part Version of MDL Principle (Informally Stated) 

Let 7i^\ Ti^ 2 ', ... be a list of candidate models (e.g., 7i^ k > is the set of fc-th degree 
polynomials), each containing a set of point hypotheses (e.g., individual polynomi- 
als). The best point hypothesis H e TiS 1 ' U TL^ 2 ' U ... to explain the data D is the 
one which minimizes the sum L(H) + L(D\H), where 

• L(H) is the length, in bits, of the description of the hypothesis; and 

• L(D\H) is the length, in bits, of the description of the data when encoded 
with the help of the hypothesis. 

The best model to explain D is the smallest model containing the selected H . 



Example 1.3 [Polynomials, cont.] In our previous example, the candidate hypothe- 
ses were polynomials. We can describe a polynomial by describing its coefficients in a 
certain precision (number of bits per parameter). Thus, the higher the degree of a poly- 
nomial or the precision, the more 5 bits we need to describe it and the more 'complex' 
it becomes. A description of the data 'with the help of a hypothesis means that the 
better the hypothesis fits the data, the shorter the description will be. A hypothesis 
that fits the data well gives us a lot of information about the data. Such information 
can always be used to compress the data (Section |2.2|l . Intuitively, this is because we 
only have to code the errors the hypothesis makes on the data rather than the full data. 
In our polynomial example, the better a polynomial H fits D, the fewer bits we need 
to encode the discrepancies between the actual y-values yi and the predicted y-values 
H{xi). We can typically find a very complex point hypothesis (large L(H)) with a very 
good fit (small L{D\H)). We can also typically find a very simple point hypothesis 
(small L(H)) with a rather bad fit (large L{D\H)). The sum of the two description 
lengths will be minimized at a hypothesis that is quite (but not too) 'simple', with a 
good (but not perfect) fit. 
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1.4 Crude and Refined MDL 

Crude MDL picks the H minimizing the sum L(H) + L(D\H). To make this procedure 
well-defined, we need to agree on precise definitions for the codes (description methods) 
giving rise to lengths L(D\H) and L(H). We now discuss these codes in more detail. 
We will see that the definition of L(H) is problematic, indicating that we somehow 
need to 'refine' our crude MDL Principle. 

Definition of L(D\H) Consider a two-part code as described above, and assume for 
the time being that all H under consideration define probability distributions. If H is 
a polynomial, we can turn it into a distribution by making the additional assumption 
that the ^-values are given by Y = H{X) + Z, where Z is a normally distributed noise 
term. 

For each H we need to define a code with length L(- | H) such that L(D\H) 
can be interpreted as 'the codelength of D when encoded with the help of H\ It 
turns out that for probabilistic hypotheses, there is only one reasonable choice for 
this code. It is the so-called Shannon-Fano code, satisfying, for all data sequences 
D, L(D\H) = —logP(D\H), where P(D\H) is the probability mass or density of D 
according to H - such a code always exists, Section |2~21 

Definition of L(H): A Problem for Crude MDL It is more problematic to 
find a good code for hypotheses H. Some authors have simply used 'intuitively rea- 
sonable' codes in the past, but this is not satisfactory: since the description length 
L(H) of any fixed point hypothesis H can be very large under one code, but quite 
short under another, our procedure is in danger of becoming arbitrary. Instead, we 
need some additional principle for designing a code for TC. In the first publications on 
MDL Rissanen 1978; Rissanen 1983 , it was advocated to choose some sort of mini- 
max code for TC, minimizing, in some precisely defined sense, the shortest worst-case 
total description length L(H) + L(D\H), where the worst-case is over all possible data 
sequences. Thus, the MDL Principle is employed at a 'meta-level' to choose a code 
for H. However, this code requires a cumbersome discretization of the model space 
TC, which is not always feasible in practice. Alternatively, Barron [1985J encoded H 
by the shortest computer program that, when input D, computes P(D\H). While it 
can be shown that this leads to similar codelengths, it is computationally problematic. 
Later, Rissanen 1984 realized that these problems could be side-stepped by using a 
one-part rather than a two-part code. This development culminated in 1996 in a com- 
pletely precise prescription of MDL for many, but certainly not all practical situations 
[Rissanen 1996J . We call this modern version of MDL refined MDL: 

Refined MDL In refined MDL, we associate a code for encoding D not with a single 
H € TC, but with the full model TC. Thus, given model TC, we encode data not in 
two parts but we design a single one-part code with lengths L(D\TC). This code is 
designed such that whenever there is a member of (parameter in) TC that fits the data 
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well, in the sense that L(D \ H) is small, then the codelength L{D\TC) will also be 
small. Codes with this property are called universal codes in the information-theoretic 



literature Barron, Rissanen, and Yu 1998 . Among all such universal codes, we pick 



the one that is minimax optimal in a sense made precise in Section l2*31 For example, the 
set TC^' of third-degree polynomials is associated with a code with lengths Z(- | TC^ 3 ') 
such that, the better the data D are fit by the best-fitting third-degree polynomial, the 
shorter the codelength L(D \ TC). L(D \ TC) is called the stochastic complexity of the 
data given the model. 

Parametric Complexity The second fundamental concept of refined MDL is the 
parametric complexity of a parametric model TC which we denote by COMP(W). This 
is a measure of the 'richness' of model TC, indicating its ability to fit random data. 
This complexity is related to the degrees-of- freedom in TC, but also to the geometrical 
structure of TC; see Example 11.41 To see how it relates to stochastic complexity, let, for 
given data D, H denote the distribution in TC which maximizes the probability, and 
hence minimizes the codelength L(D | H) of D. It turns out that 

stochastic complexity of D given TC = L(D \ H) + COMP(W). 

Refined MDL model selection between two parametric models (such as the models of 
first and second degree polynomials) now proceeds by selecting the model such that 
the stochastic complexity of the given data D is smallest. Although we used a one-part 
code to encode data, refined MDL model selection still involves a trade-off between two 
terms: a goodness-of-fit term L{D \ H) and a complexity term COMP(W). However, 
because we do not explicitly encode hypotheses H any more, there is no arbitrariness 
any more. The resulting procedure can be interpreted in several different ways, some 
of which provide us with rationales for MDL beyond the pure coding interpretation 
(Sections EOEH3): 

1. Counting/differential geometric interpretation The parametric complexity of 

a model is the logarithm of the number of essentially different, distinguishable 
point hypotheses within the model. 

2. Two-part code interpretation For large samples, the stochastic complexity can 

be interpreted as a two-part codelength of the data after all, where hypotheses H 
are encoded with a special code that works by first discretizing the model space 
TC into a set of 'maximally distinguishable hypotheses', and then assigning equal 
codelength to each of these. 

3. Bayesian interpretation In many cases, refined MDL model selection coincides 

with Bayes factor model selection based on a non-informative prior such as Jef- 
freys' prior [Bernardo and Smith 1 994 . 

4. Prequential interpretation Refined MDL model selection can be interpreted as 

selecting the model with the best predictive performance when sequentially pre- 
dicting unseen test data, in the sense described in Section 12.6.41 This makes it 
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an instance of Dawid's 1984 prequential model validation and also relates it to 
cross-validation methods. 

Refined MDL allows us to compare models of different functional form. It even accounts 
for the phenomenon that different models with the same number of parameters may 
not be equally 'complex': 

Example 1.4 Consider two models from psychophysics describing the relationship be- 
tween physical dimensions (e.g., light intensity) and their psychological counterparts 
(e.g. brightness) |Myung, Balasubramanian, and Pitt 2000 : y = ax b + Z (Stevens' 



model) and y = aln(x + b) + Z (Fechner's model) where Z is a normally distributed 
noise term. Both models have two free parameters; nevertheless, it turns out that in 
a sense, Stevens' model is more flexible or complex than Fechner's. Roughly speaking, 
this means there are a lot more data patterns that can be explained by Stevens' model 
than can be explained by Fechner's model. Myung, Balasubramanian, and Pitt [2000 



generated many samples of size 4 from Fechner's model, using some fixed parameter 
values. They then fitted both models to each sample. In 67% of the trials, Stevens' 
model fitted the data better than Fechner's, even though the latter generated the data. 
Indeed, in refined MDL, the 'complexity' associated with Stevens' model is much larger 
than the complexity associated with Fechner's, and if both models fit the data equally 
well, MDL will prefer Fechner's model. 

Summarizing, refined MDL removes the arbitrary aspect of crude, two-part code MDL 
and associates parametric models with an inherent 'complexity' that does not depend 
on any particular description method for hypotheses. We should, however, warn the 
reader that we only discussed a special, simple situation in which we compared a finite 
number of parametric models that satisfy certain regularity conditions. Whenever the 
models do not satisfy these conditions, or if we compare an infinite number of models, 
then the refined ideas have to be extended. We then obtain a 'general' refined MDL 
Principle, which employs a combination of one-part and two-part codes. 

1.5 The MDL Philosophy 

The first central MDL idea is that every regularity in data may be used to compress 
that data; the second central idea is that learning can be equated with finding regu- 
larities in data. Whereas the first part is relatively straightforward, the second part of 
the idea implies that methods for learning from data must have a clear interpretation 
independent of whether any of the models under consideration is 'true ' or not. Quoting 
J. Rissanen 1989], the main originator of MDL: 

"We never want to make the false assumption that the observed data actually 
were generated by a distribution of some kind, say Gaussian, and then go on 
to analyze the consequences and make further deductions. Our deductions may 
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be entertaining but quite irrelevant to the task at hand, namely, to learn useful 
properties from the data." 

Jorma Rissanen [1989] 

Based on such ideas, Rissanen has developed a radical philosophy of learning and 
statistical inference that is considerably different from the ideas underlying mainstream 
statistics, both frequentist and Bayesian. We now describe this philosophy in more 
detail: 

1. Regularity as Compression According to Rissanen, the goal of inductive in- 
ference should be to 'squeeze out as much regularity as possible' from the given data. 
The main task for statistical inference is to distill the meaningful information present 
in the data, i.e. to separate structure (interpreted as the regularity, the 'meaningful 
information') from noise (interpreted as the 'accidental information'). For the three 
sequences of Example E2 this would amount to the following: the first sequence would 
be considered as entirely regular and 'noiseless'. The second sequence would be con- 
sidered as entirely random - all information in the sequence is accidental, there is no 
structure present. In the third sequence, the structural part would (roughly) be the 
pattern that 4 times as many Os than Is occur; given this regularity, the description 
of exactly which of all sequences with four times as many Os than Is occurs, is the 
accidental information. 

2. Models as Languages Rissanen interprets models (sets of hypotheses) as nothing 
more than languages for describing useful properties of the data - a model Ti is identified 
with its corresponding universal code L(- | Ti). Different individual hypotheses within 
the models express different regularities in the data, and may simply be regarded as 
statistics, that is, summaries of certain regularities in the data. These regularities are 
present and meaningful independently of whether some H* £ TC is the 'true state of 
nature' or not. Suppose that the model Ti under consideration is probabilistic. In 
traditional theories, one typically assumes that some P* £ TC generates the data, and 
then 'noise' is defined as a random quantity relative to this P*. In the MDL view 
'noise' is defined relative to the model Ti as the residual number of bits needed to 
encode the data once the model Ti is given. Thus, noise is not a random variable: it 
is a function only of the chosen model and the actually observed data. Indeed, there 
is no place for a 'true distribution' or a 'true state of nature' in this view - there are 
only models and data. To bring out the difference to the ordinary statistical viewpoint, 
consider the phrase 'these experimental data are quite noisy'. According to a traditional 
interpretation, such a statement means that the data were generated by a distribution 
with high variance. According to the MDL philosophy, such a phrase means only that 
the data are not compressible with the currently hypothesized model - as a matter of 
principle, it can never be ruled out that there exists a different model under which the 
data are very compressible (not noisy) after all! 
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3. We Have Only the Data Many (but not all 6 ) other methods of inductive 
inference are based on the idea that there exists some 'true state of nature', typically 
a distribution assumed to lie in some model Ti. The methods are then designed as a 
means to identify or approximate this state of nature based on as little data as possible. 
According to Rissanen , such methods are fundamentally flawed. The main reason is 
that the methods are designed under the assumption that the true state of nature is 
in the assumed model Ti, which is often not the case. Therefore, such methods only 
admit a clear interpretation under assumptions that are typically violated in practice. 
Many cherished statistical methods are designed in this way - we mention hypothesis 
testing, minimum-variance unbiased estimation, several non-parametric methods, and 
even some forms of Bayesian inference - see Example 12.221 In contrast, MDL has a 
clear interpretation which depends only on the data, and not on the assumption of any 
underlying 'state of nature'. 

Example 1.5 [Models that are Wrong, yet Useful] Even though the models 
under consideration are often wrong, they can nevertheless be very useful. Ex- 
amples are the successful 'Naive Bayes' model for spam filtering, Hidden Markov 
Models for speech recognition (is speech a stationary ergodic process? probably 
not), and the use of linear models in econometrics and psychology. Since these 
models are evidently wrong, it seems strange to base inferences on them using 
methods that are designed under the assumption that they contain the true distri- 
bution. To be fair, we should add that domains such as spam filtering and speech 
recognition are not what the fathers of modern statistics had in mind when they 
designed their procedures - they were usually thinking about much simpler do- 
mains, where the assumption that some distribution P* £ Ti is 'true' may not be 
so unreasonable. 

4. MDL and Consistency Let Ti be a probabilistic model, such that each P £ Ti is 
a probability distribution. Roughly, a statistical procedure is called consistent relative 
to Ti if, for all P* £ Ti, the following holds: suppose data are distributed according 
to P* . Then given enough data, the learning method will learn a good approximation 
of P* with high probability. Many traditional statistical methods have been designed 
with consistency in mind (Section I2.3J) . 

The fact that in MDL, we do not assume a true distribution may suggest that we do 
not care about statistical consistency. But this is not the case: we would still like our 
statistical method to be such that in the idealized case where one of the distributions in 
one of the models under consideration actually generates the data, our method is able 
to identify this distribution, given enough data. If even in the idealized special case 
where a 'truth' exists within our models, the method fails to learn it, then we certainly 
cannot trust it to do something reasonable in the more general case, where there may 
not be a 'true distribution' underlying the data at all. So: consistency is important 
in the MDL philosophy, but it is used as a sanity check (for a method that has been 
developed without making distributional assumptions) rather than as a design principle. 

In fact, mere consistency is not sufficient. We would like our method to con- 
verge to the imagined true P* fast, based on as small a sample as possible. Two- 
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part code MDL with 'clever' codes achieves good rates of convergence in this sense 



(Barron and Cover [1991] , complemented by Zhang 2004 , show that in many situa- 



tions, the rates are minimax optimal). The same seems to be true for refined one-part 
code MDL |Barron, Rissanen, and Yu 1998 , although there is at least one surprising 



exception where inference based on the NML and Bayesian universal model behaves 
abnormally - see ICsiszar and Shields 2000] for the details. 

Summarizing this section, the MDL philosophy is quite agnostic about whether any of 
the models under consideration is 'true', or whether something like a 'true distribution' 
even exists. Nevertheless, it has been suggested jWebb 19961 Domingos 1999 that 



MDL embodies a naive belief that 'simple models' are 'a priori more likely to be true' 
than complex models. Below we explain why such claims are mistaken. 

1.6 MDL and Occam's Razor 

When two models fit the data equally well, MDL will choose the one that is the 'sim- 
plest' in the sense that it allows for a shorter description of the data. As such, it imple- 
ments a precise form of Occam's Razor - even though as more and more data becomes 
available, the model selected by MDL may become more and more 'complex'! Occam's 
Razor is sometimes criticized for being either (1) arbitrary or (2) false |Webb 1 996: 



Domingos 1999 . Do these criticisms apply to MDL as well? 



'1. Occam's Razor (and MDL) is arbitrary' Because 'description length' is a 
syntactic notion it may seem that MDL selects an arbitrary model: different codes 
would have led to different description lengths, and therefore, to different models. By 
changing the encoding method, we can make 'complex' things 'simple' and vice versa. 
This overlooks the fact we are not allowed to use just any code we like! 'Refined' 
MDL tells us to use a specific code, independent of any specific parameterization of 
the model, leading to a notion of complexity that can also be interpreted without any 
reference to 'description lengths' (see also Section 12. 10. 1[) . 

'2. Occam's Razor is false' It is often claimed that Occam's razor is false - we 
often try to model real-world situations that are arbitrarily complex, so why should we 
favor simple models? In the words of G. Webb 8 : 'What good are simple models of a 
complex world?' 

The short answer is: even if the true data generating machinery is very complex, 
it may be a good strategy to prefer simple models for small sample sizes. Thus, MDL 
(and the corresponding form of Occam's razor) is a strategy for inferring models from 
data ("choose simple models at small sample sizes"), not a statement about how the 
world works ("simple models are more likely to be true") - indeed, a strategy cannot 
be true or false, it is 'clever' or 'stupid'. And the strategy of preferring simpler models 



17 



is clever even if the data generating process is highly complex, as illustrated by the 
following example: 

Example 1.6 ['Infinitely' Complex Sources] Suppose that data are subject to the 
law Y = g(X) + Z where g is some continuous function and Z is some noise term 
with mean 0. If g is not a polynomial, but X only takes values in a finite interval, 
say [—1,1], we may still approximate g arbitrarily well by taking higher and higher 
degree polynomials. For example, let g{x) = exp(x). Then, if we use MDL to learn 
a polynomial for data D = ((xi,yi), ... , (x n ,y n )), the degree of the polynomial f( n > 
selected by MDL at sample size n will increase with n, and with high probability, 
p n > converges to g{x) = exp(x) in the sense that max xe r_iu \p n '{x) — g(x)\ — > 0. Of 
course, if we had better prior knowledge about the problem we could have tried to learn 
g using a model class M containing the function y = exp(x). But in general, both our 
imagination and our computational resources are limited, and we may be forced to use 
imperfect models. 

If, based on a small sample, we choose the best-fitting polynomial / within the set 
of all polynomials, then, even though / will fit the data very well, it is likely to be 
quite unrelated to the 'true' g, and / may lead to disastrous predictions of future 
data. The reason is that, for small samples, the set of all polynomials is very large 
compared to the set of possible data patterns that we might have observed. Therefore, 
any particular data pattern can only give us very limited information about which 
high-degree polynomial best approximates g. On the other hand, if we choose the 
best-fitting f° in some much smaller set such as the set of second-degree polynomials, 
then it is highly probable that the prediction quality (mean squared error) of f° on 
future data is about the same as its mean squared error on the data we observed: the 
size (complexity) of the contemplated model is relatively small compared to the set of 
possible data patterns that we might have observed. Therefore, the particular pattern 
that we do observe gives us a lot of information on what second-degree polynomial best 
approximates g. 

Thus, (a) f° typically leads to better predictions of future data than /; and (b) 
unlike /, f° is reliable in that it gives a correct impression of how good it will pre- 
dict future data even if the 'true' g is 'infinitely' complex. This idea does not just 
appear in MDL, but is also the basis of Vapnik's |1998j Structural Risk Minimization 
approach and many standard statistical methods for non-parametric inference. In such 
approaches one acknowledges that the data generating machinery can be infinitely com- 
plex (e.g., not describable by a finite degree polynomial). Nevertheless, it is still a good 
strategy to approximate it by simple hypotheses (low-degree polynomials) as long as 
the sample size is small. Summarizing: 
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The Inherent Difference between Under- and Overfitting 

If we choose an overly simple model for our data, then the best-fitting point hy- 
pothesis within the model is likely to be almost the best predictor, within the 
simple model, of future data coming from the same source. If we overfit (choose 
a very complex model) and there is noise in our data, then, even if the complex 
model contains the 'true ' point hypothesis, the best-fitting point hypothesis within 
the model is likely to lead to very bad predictions of future data coming from the 
same source. 



This statement is very imprecise and is meant more to convey the general idea than 
to be completely true. As will become clear in Section I2.1U.11 it becomes provably 
true if we use MDL's measure of model complexity; we measure prediction quality by 
logarithmic loss; and we assume that one of the distributions in Ti actually generates 
the data. 

1.7 History 

The MDL Principle has mainly been developed by J. Rissanen in a series of papers 
starting with |Rissanen"T9 78 . It has its roots in the theory of Kolmogorov or algorith- 
mic complexity |Li and Vitanyi 1997 , developed in the 1960s by Solomonoff [1964 , 



Kolmogorov [1965 and Chaitin [19661 ^969 . Among these authors, Solomonoff (a 
former student of the famous philosopher of science, Rudolf Carnap) was explicitly in- 
terested in inductive inference. The 1964 paper contains explicit suggestions on how the 
underlying ideas could be made practical, thereby foreshadowing some of the later work 
on two-part MDL. While Rissanen was not aware of Solomonoff's work at the time, 
Kolmogorov's [1965] paper did serve as an inspiration for Rissanen's 1978 development 
of MDL. 

Another important inspiration for Rissanen was Akaike's [1973 AIC method for 
model selection, essentially the first model selection method based on information- 
theoretic ideas. Even though Rissanen was inspired by AIC, both the actual method 
and the underlying philosophy are substantially different from MDL. 

MDL is much closer related to the Minimum Message Length Principle, devel- 
oped by Wallace and his co-workers in a series of papers starting with the ground- 
breaking Walla ce and Boulton 1968] : other milestones are [Wallace and Boulton 1 975 
and Wal lace and Freeman 1987J . Remarkably, Wallace developed his ideas without be- 
ing aware of the notion of Kolmogorov complexity. Although Rissanen became aware of 
Wallace's work before the publication of Ris sanen 1978J . he developed his ideas mostly 
independently, being influenced rather by Akaike and Kolmogorov. Indeed, despite 
the close resemblance of both methods in practice, the underlying philosophy is quite 
different - see Section l2"Hl 

The first publications on MDL only mention two-part codes. Important progress 
was made by Rissanen [1984|, in which prequential codes are employed for the first 
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time and [Rissanen 1987J . introducing the Bayesian mixture codes into MDL. This led 
to the development of the notion of stochastic complexity as the shortest codelength 
of the data given a model Rissanen 1986; Rissanen 1987 . However, the connection to 
Shtarkov's normalized maximum likelihood code was not made until 1996, and this pre- 
vented the full development of the notion of 'parametric complexity'. In the mean time, 
in his impressive Ph.D. thesis, Barron [1985| showed how a specific version of the two- 



part code criterion has excellent frequentist statistical consistency properties. This was 
extended by Barron and Cover [1991] who achieved a breakthrough for two-part codes: 



they gave clear prescriptions on how to design codes for hypotheses, relating codes with 
good minimax codelength properties to rates of convergence in statistical consistency 
theorems. Some of the ideas of Rissanen [1987| and Barron and Cover [1991] were, as 



it were, unified when Rissanen [1996 introduced a new definition of stochastic com- 
plexity based on the normalized maximum likelihood code (Section I2.5JI . The resulting 
theory was summarized for the first time by Barron, Rissanen, and Yu [1998] , and is 
called 'refined MDL' in the present overview. 

1.8 Summary and Outlook 

We discussed how regularity is related to data compression, and how MDL employs this 
connection by viewing learning in terms of data compression. One can make this precise 
in several ways; in idealized MDL one looks for the shortest program that generates 
the given data. This approach is not feasible in practice, and here we concern ourselves 
with practical MDL. Practical MDL comes in a crude version based on two-part codes 
and in a modern, more refined version based on the concept of universal coding. The 
basic ideas underlying all these approaches can be found in the boxes spread throughout 
the text. 

These methods are mostly applied to model selection but can also be used for other 
problems of inductive inference. In contrast to most existing statistical methodology, 
they can be given a clear interpretation irrespective of whether or not there exists 
some 'true' distribution generating data - inductive inference is seen as a search for 
regular properties in (interesting statistics of) the data, and there is no need to assume 
anything outside the model and the data. In contrast to what is sometimes thought, 
there is no implicit belief that 'simpler models are more likely to be true' - MDL does 
embody a preference for 'simple' models, but this is best seen as a strategy for inference 
that can be useful even if the environment is not simple at all. 

In the next chapter, we make precise both the crude and the refined versions of 
practical MDL. For this, it is absolutely essential that the reader familiarizes him- or 
herself with two basic notions of coding and information theory: the relation between 
codelength functions and probability distributions, and (for refined MDL), the idea of 
universal coding - a large part of the chapter will be devoted to these. 



20 



Notes 

1. See Section I21H1 Example 12331 

2. By this we mean that a universal Turing Machine can be implemented in it |Li and Vitanyi 1997| . 

3. Strictly speaking, in our context it is not very accurate to speak of 'simple' or 'complex' polyno- 
mials; instead we should call the set of first degree polynomials 'simple', and the set of 100-th degree 
polynomials 'complex'. 

4. The terminology 'crude MDL' is not standard. It is introduced here for pedagogical reasons, 
to make clear the importance of having a single, unified principle for designing codes. It should 
be noted that Rissanen's and Barron's early theoretical papers on MDL already contain such prin- 
ciples, albeit in a slightly different form than in their recent papers. Early practical applications 

Quinlan and R ivest 1989| IGriinwald 1996| often do use ad hoc two-part codes which really are 'crude' 
in the sense defined here. 

5. See the previous note. 

6. For example, cross-validation cannot easily be interpreted in such terms of 'a method hunting for 
the true distribution'. 

7. The present author's own views are somewhat milder in this respect, but this is not the place to 
discuss them. 

8. Quoted with permission from KDD Nuggets 96:28, 1996. 
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Chapter 2 

Minimum Description Length 
Tutorial 

2.1 Plan of the Tutorial 

In Chapter ^ we introduced the MDL Principle in an informal way. In this chapter we 
give an introduction to MDL that is mathematically precise. Throughout the text, we 
assume some basic familiarity with probability theory. While some prior exposure to 
basic statistics is highly useful, it is not required. The chapter can be read without 
any prior knowledge of information theory. The tutorial is organized according to the 
following plan: 

• The first two sections are of a preliminary nature: 

— Any understanding of MDL requires some minimal knowledge of information 
theory - in particular the relationship between probability distributions and 
codes. This relationship is explained in Section [2.21 

— Relevant statistical notions such as 'maximum likelihood estimation' are 
reviewed in Section 12.31 There we also introduce the Markov chain model 
which will serve as an example model throughout the text. 

• Based on this preliminary material, in Section [2.41 we formalize a simple version 
of the MDL Principle, called the crude two-part MDL Principle in this text. We 
explain why, for successful practical applications, crude MDL needs to be refined. 

• Section [2~o1 is once again preliminary: it discusses universal coding, the information- 
theoretic concept underlying refined versions of MDL. 

• Sections 12 . 6H2 .81 define and discuss refined MDL. They are the key sections of the 
tutorial: 

— Section 12.61 discusses basic refined MDL for comparing a finite number of 
simple statistical models and introduces the central concepts of parametric 
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and stochastic complexity. It gives an asymptotic expansion of these quanti- 
ties and interprets them from a compression, a geometric, a Bayesian and a 
predictive point of view. 

— Section 12.71 extends refined MDL to harder model selection problems, and 
in doing so reveals the general, unifying idea, which is summarized in Fig- 
ure El 

— Section 12.81 briefly discusses how to extend MDL to applications beyond 
model section. 

Having defined 'refined MDL' in Sections l2.6H2.8l the next two sections place it 
in context: 



— Section [2.91 compares MDL to other approaches to inductive inference, most 
notably the related but different Bayesian approach. 

— Section I2.1UI discusses perceived as well as real problems with MDL. The 
perceived problems relate to MDL's relation to Occam's Razor, the real 
problems relate to the fact that applications of MDL sometimes perform 
suboptimally in practice. 

Finally, Section [2.111 provides a conclusion. 



Reader's Guide 

Throughout the text, paragraph headings reflect the most important concepts. 
Boxes summarize the most important findings. Together, paragraph headings and 
boxes provide an overview of MDL theory. 

It is possible to read this chapter without having read the non-technical overview 
of Chapter^ However, we strongly recommend reading at least Sections 11.31 and 
Section 11.41 before embarking on the present chapter. 



2.2 Information Theory I: Probabilities and Codelengths 

This first section is a mini-primer on information theory, focusing on the relationship 
between probability distributions and codes. A good understanding of this relationship 
is essential for a good understanding of MDL. After some preliminaries, Section [2.2.11 
introduces prefix codes, the type of codes we work with in MDL. These are related to 
probability distributions in two ways. In Section 12.2.21 we discuss the first relationship, 
which is related to the Kraft inequality: for every probability mass function P, there 
exists a code with lengths — log P, and vice versa. Section 12.2.31 discusses the second 
relationship, related to the information inequality, which says that if the data are 
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distributed according to P, then the code with lengths — log P achieves the minimum 
expected codelength. Throughout the section we give examples relating our findings to 
our discussion of regularity and compression in Section 11.21 of Chapter ^ 

Preliminaries and Notational Conventions - Codes We use log to denote log- 
arithm to base 2. For real-valued x we use \x\ to denote the ceiling of x, that is, 
x rounded up to the nearest integer. We often abbreviate xi, . . . ,x n to x n . Let X 
be a finite or countable set. A code for X is defined as a 1-to-l mapping from X to 
U n >i{0, 1}". U n >i{0, l} n is the set of binary strings (sequences of Os and Is) of length 
1 or larger. For a given code C, we use C(x) to denote the encoding of x. Every code 
C induces a function Lq : X — > N called the codelength function. Lc{x) is the number 
of bits (symbols) needed to encode x using code C. 

Our definition of code implies that we only consider lossless encoding in MDL 1 : for 
any description z it is always possible to retrieve the unique x that gave rise to it. More 
precisely, because the code C must be 1-to-l, there is at most one x with C(x) = z. 
Then x = C^ 1 (z), where the inverse C -1 of C is sometimes called a 'decoding function' 
or 'description method'. 

Preliminaries and Notational Conventions - Probability Let P be a proba- 
bility distribution defined on a finite or countable set X. We use P(x) to denote the 
probability of x, and we denote the corresponding random variable by X. If P is a 
function on finite or countable X such that ^2 x P(x) < 1, we call P a defective distri- 
bution. A defective distribution may be thought of as a probability distribution that 
puts some of its mass on an imagined outcome that in reality will never appear. 

A probabilistic source P is a sequence of probability distributions P^', P^ 2 \ ... on 
X 1 ^ 2 , . . . such that for all n, P^ n ' and p( ra+1 ) are compatible: P( n > is equal to the 
'marginal' distribution of p( n+1 ) restricted to n outcomes. That is, for all x n £ X n , 
p( n >(x n ) = ^2 ve x P^ n+1 '(x n , y). Whenever this cannot cause any confusion, we write 
P(x n ) rather than P^ n \x n ). A probabilistic source may be thought of as a probability 
distribution on infinite sequences 2 . We say that the data are i.i.d. (independently and 
identically distributed) under source P if for each n, x n £ X n , P(x n ) = Yl2=i P{ x i)- 

2.2.1 Prefix Codes 

In MDL we only work with a subset of all possible codes, the so-called prefix codes. 
A prefix code 3 is a code such that no codeword is a prefix of any other codeword. 
For example, let X = {a,b,c}. Then the code C\ defined by C\{a) = 0, C\(b) = 10, 
Ci(c) = 11 is prefix. The code C 2 with C 2 {a) = 0,C 2 (6) = 10 and C 2 {c) = 01, while 
allowing for lossless decoding, is not a prefix code since is a prefix of 01. The prefix 
requirement is natural, and nearly ubiquitous in the data compression literature. We 
now explain why. 

Example 2.1 Suppose we plan to encode a sequence of symbols (x\, . . . ,x n ) £ X n . 
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We already designed a code C for the elements in X. The natural thing to do is to 
encode (xi, ■ ■ ■ , x n ) by the concatenated string C(x\)C{x2) ■ ■ ■ C(x n ). In order for this 
method to succeed for all n, all [x\, . . . , x n ) G X n , the resulting procedure must define 
a code, i.e. the function C^ n > mapping (x\, . . . ,x n ) to C(x\)C(x2) . . . C{x n ) must be 
invertible. If it were not, we would have to use some marker such as a comma to 
separate the codewords. We would then really be using a ternary rather than a binary 
alphabet. 

Since we always want to construct codes for sequences rather than single symbols, 
we only allow codes C such that the extension C^ n ' defines a code for all n. We say 
that such codes have 'uniquely decodable extensions'. It is easy to see that (a) every 
prefix code has uniquely decodable extensions. Conversely, although this is not at all 
easy to see, it turns out that (b), for every code C with uniquely decodable extensions, 
there exists a prefix code Co such that for all n,x n £ X n , L C ( n )(x n ) = L ( n) (x n ) 

[Cover and Thomas 1991] . Since in MDL we are only interested in code-lengths, and 
never in actual codes, we can restrict ourselves to prefix codes without loss of generality. 
Thus, the restriction to prefix code may also be understood as a means to send 
concatenated messages while avoiding the need to introduce extra symbols into the 
alphabet. 

Whenever in the sequel we speak of 'code', we really mean 'prefix code'. We call a 
prefix code C for a set X complete if there exists no other prefix code that compresses 
at least one x more and no x less then C, i.e. if there exists no code C such that for 
all x, Lc'(x) < Lc(x) with strict inequality for at least one x. 

2.2.2 The Kraft Inequality - Codelengths and Probabilities, Part I 

In this subsection we relate prefix codes to probability distributions. Essential for 
understanding the relation is the fact that no matter what code we use, most sequences 
cannot be compressed, as demonstrated by the following example: 

Example 2.2 [Compression and Small Subsets: Example 11.21 Continued.] 

In Example ll.2l we featured the following three sequences: 

00010001000100010001 ... 0001000100010001000100010001 (2.1) 

01110100110100100110 ... 1010111010111011000101100010 (2.2) 

00011000001010100000 ... 0010001000010000001000110000 (2.3) 

We showed that (a) the first sequence - an n-fold repetition of 0001 - could be sub- 
stantially compressed if we use as our code a general-purpose programming language 
(assuming that valid programs must end with a halt-statement or a closing bracket, 
such codes satisfy the prefix property). We also claimed that (b) the second sequence, n 
independent outcomes of fair coin tosses, cannot be compressed, and that (c) the third 
sequence could be compressed to an bits, with < a < 1. We are now in a position 
to prove statement (b): strings which are 'intuitively' random cannot be substantially 
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compressed. Let us take some arbitrary but fixed description method over the data 
alphabet consisting of the set of all binary sequences of length n. Such a code maps 
binary strings to binary strings. There are 2 n possible data sequences of length n. 
Only two of these can be mapped to a description of length 1 (since there are only two 
binary strings of length 1: '0' and '1'). Similarly, only a subset of at most 2 m sequences 
can have a description of length m. This means that at most ^2=i 2 m < 2 m+1 data 
sequences can have a description length < m. The fraction of data sequences of length 
n that can be compressed by more than k bits is therefore at most 2~ k and as such 
decreases exponentially in k. If data are generated by n tosses of a fair coin, then all 2 n 
possibilities for the data are equally probable, so the probability that we can compress 
the data by more than k bits is smaller than 2 . For example, the probability that 
we can compress the data by more than 20 bits is smaller than one in a million. 

We note that after the data 1)2. 2jl has been observed, it is always possible to design a 
code which uses arbitrarily few bits to encode this data - the actually observed sequence 
may be encoded as '1' for example, and no other sequence is assigned a codeword. The 
point is that with a code that has been designed before seeing any data, it is virtually 
impossible to substantially compress randomly generated data. 

The example demonstrates that achieving a short description length for the data is 
equivalent to identifying the data as belonging to a tiny, very special subset out of all 
a priori possible data sequences. 

A Most Important Observation Let Z be finite or countable. For concreteness, 
we may take Z = {0, l} n for some large n, say n = 10000. From Example 12.21 we know 
that, no matter what code we use to encode values in Z, 'most' outcomes in Z will 
not be substantially compressible: at most two outcomes can have description length 
1 = — log 1/2; at most four outcomes can have length 2 = — log 1/4, and so on. Now 
consider any probability distribution on Z. Since the probabilities P(z) must sum up to 
one (%2z P(z) = 1), 'most' outcomes in Z must have small probability in the following 
sense: at most 2 outcomes can have probability > 1/2; at most 4 outcomes can have 
probability > 1/4; at most 8 can have > 1/8-th etc. This suggests an analogy between 
codes and probability distributions: each code induces a code length function that 
assigns a number to each z, where most z's are assigned large numbers. Similarly, each 
distribution assigns a number to each z, where most z's are assigned small numbers. 

It turns out that this correspondence can be made mathematically precise by means 
of the Kraft inequality [Cover and Thomas 199 J] . We neither precisely state nor prove 
this inequality; rather, in Figure 12.11 we state an immediate and fundamental conse- 
quence: probability mass functions correspond to codelength functions. The following 
example illustrates this and at the same time introduces a type of code that will be 
frequently employed in the sequel: 

Example 2.3 [Uniform Distribution Corresponds to Fixed-length Code] Sup- 
pose Z has M elements. The uniform distribution Py assigns probabilities 1/M to each 
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Probability Mass Functions correspond to Codelength Functions 

Let Z be a finite or countable set and let P be a probability distribution on Z. Then 
there exists a prefix code C for Z such that for all z G i?, Lq(z) = [— logP(z)]. 
C is called the code corresponding to P. 

Similarly, let C be a prefix code for Z. Then there exists a (possibly defective) 
probability distribution P' such that for all z £ Z, — log P'(z) = Lc>(z). P' is 
called the probability distribution corresponding to C. 

Moreover C" is a complete prefix code iff P is proper (J^ -P(z) = 1). 

Thus, large probability according to P means small code length according to the 
code corresponding to P and vice versa. 

We are typically concerned with cases where Z represents sequences of n outcomes; 
that is, Z = X n (n > 1) where X is the sample space for one observation. 



Figure 2.1: The most important observation of this tutorial. 

element. We can arrive at a code corresponding to Pjj as follows. First, order and num- 
ber the elements in Z as 0, 1, . . . , M— 1. Then, for each z with number j, set C{z) to be 
equal to j represented as a binary number with [log AT] bits. The resulting code has, 
for all z £ Z, Lc(z) = [logM] = \ — log Pjj (z)~\ . This is a code corresponding to Pjj 
(Figure \2. II) . In general, there exist several codes corresponding to Pu, one for each or- 
dering of Z. But all these codes share the same length function Ljj(z) := [— logPjj(z)~\ .; 
therefore, Ljj(z) is the unique codelength function corresponding to Pu. 

For example, if M = 4, Z = {a, b, c, d}, we can take C(a) = 00, C(b) = 01, C(c) = 
10, C{d) = 11 and then Lfj(z) = 2 for all z £ Z. In general, codes corresponding to 
uniform distributions assign fixed lengths to each z and are called fixed-length codes. 
To map a non-uniform distribution to a corresponding code, we have to use a more 
intricate construction [Cover and Thomas 199 J] . 

In practical applications, we almost always deal with probability distributions P and 
strings x n such that P(x n ) decreases exponentially in n; for example, this will typically 
be the case if data are i.i.d., such that P(x n ) = Y\P(xi). Then — logP(x n ) increases 
linearly in n and the effect of rounding off — logP(x n ) becomes negligible. Note that 
the code corresponding to the product distribution of P on X n does not have to be the 
n-fold extension of the code for the original distribution P on X - if we were to require 
that, the effect of rounding off would be on the order of n . Instead, we directly design a 
code for the distribution on the larger space Z = X n . In this way, the effect of rounding 
changes the codelength by at most 1 bit, which is truly negligible. For this and other 4 
reasons, we henceforth simply neglect the integer requirement for codelengths. This 
simplification allows us to identify codelength functions and (defective) probability 
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New Definition of Code Length Function 

In MDL we are NEVER concerned with actual encodings; we are only concerned 
with code length functions. The set of all codelength functions for finite or count- 
able sample space Z is defined as: 

L z = {L : Z - [0, oo] | £ 2 ~ L(Z) < 1}. ( 2 - 4 ) 

or equivalently, Cz is the set of those functions L on Z such that there exists 
a function Q with Y1 Z Q( Z ) < 1 and for all z, L(z) = —logQ(z). (Q(z) = 
corresponds to L(z) = oo). 

Again, Z usually represents a sample of n outcomes: Z = X n (n > 1) where X is 
the sample space for one observation. 



Figure 2.2: Code lengths are probabilities. 

mass functions, such that a short codelength corresponds to a high probability and 
vice versa. Furthermore, as we will see, in MDL we are not interested in the details of 
actual encodings C(z); we only care about the code lengths Lc{z). It is so useful to 
think about these as log-probabilities, and so convenient to allow for non-integer non- 
probabilities, that we will simply redefine prefix code length functions as (defective) 
probability mass functions that can have non-integer code lengths - see Figure 12.21 
The following example illustrates idealized codelength functions and at the same time 
introduces a type of code that will be frequently used in the sequel: 

Example 2.4 'Almost' Uniform Code for the Positive Integers Suppose we 
want to encode a number k £ {1,2,...}. In Example 12.31 we saw that in order to 
encode a number between 1 and M, we need logM bits. What if we cannot determine 
the maximum M in advance? We cannot just encode k using the uniform code for 
{1, . . . , k}, since the resulting code would not be prefix. So in general, we will need 
more than log A: bits. Yet there exists a prefix-free code which performs 'almost' as well 
as log k. The simplest of such codes works as follows, k is described by a codeword 
starting with [log A;] Os. This is followed by a 1, and then k is encoded using the 
uniform code for {1, . . . , 2' g '}. With this protocol, a decoder can first reconstruct 
[log k] by counting all O's before the leftmost 1 in the encoding. He then has an upper 
bound on k and can use this knowledge to decode k itself. This protocol uses less than 
2 [log k~\ + 1 bits. Working with idealized, non-integer code-lengths we can simplify 
this to 2 log k + 1 bits. To see this, consider the function P{x) = 2~ gx ~ 1 . An easy 
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calculation gives 

y p( X )= y 2 ~ 2io ^- i = - y x - 2 <- + - y , l =1, 

^ K ' *-*> 2 ^ 2 2 *~i x(x-l) 

xel,2,.., xel,2,... x£l,2,... x=2,3,... v ' 

so that P is a (defective) probability distribution. Thus, by our new definition (Fig- 
ure 12.2(1 . there exists a prefix code with, for all k, L{k) = — log P(k) = 2 log A; + 1. We 
call the resulting code the 'simple standard code for the integers'. In Section 12.51 we 
will see that it is an instance of a so-called 'universal' code. 

The idea can be refined to lead to codes with lengths log k + 0(log log k); the 'best' 
possible refinement, with code lengths L(k) increasing monotonically but as slowly as 
possible in k, is known as 'the universal code for the integers' jRissanen 1983) . However, 
for our purposes in this tutorial, it is good enough to encode integers k with 2 log k + \ 
bits. 

Example 2.5 [Example II . 21 and 12.21 Continued.] We are now also in a posi- 
tion to prove the third and final claim of Examples I1.2l and l2.2l Consider the three 
sequences 1(2.1(1 . 1(2.2(1 and ((2.3(1 on page EH again. It remains to investigate how 
much the third sequence can be compressed. Assume for concreteness that, before 
seeing the sequence, we are told that the sequence contains a fraction of Is equal 
to 1/5 + e for some small unknown e. By the Kraft inequality, Figure |2~T1 for all 
distributions P, there exists some code on sequences of length n such that for all 
x n E X n , L(x n ) = \— logP(a; n )] . The fact that the fraction of Is is approximately 
equal to 1/5 suggests to model x n as independent outcomes of a coin with bias 
1/5-th. The corresponding distribution Pq satisfies 

/i \ ™[i] /4\ ™[o] i 14 4 

-logPo(z")=log(^-J (^-J =n[-(- + e)logg-(--e)log-] = 

Q 

n[log5-- + 2e], 

5 

where nu] denotes the number of occurrences of symbol j in x n . For small enough e, 
the part between brackets is smaller than 1, so that, using the code Lq with lengths 
— logPo, the sequence can be encoded using an bits were a satisfies < a < 1. 
Thus, using the code Lq, the sequence can be compressed by a linear amount, if 
we use a specially designed code that assigns short codelengths to sequences with 
about four times as many Os than Is. 

We note that after the data ((2.31) has been observed, it is always possible to design 
a code which uses arbitrarily few bits to encode x n - the actually observed sequence 
may be encoded as '1' for example, and no other sequence is assigned a codeword. 
The point is that with a code that has been designed before seeing the actual 
sequence, given only the knowledge that the sequence will contain approximately 
four times as many Os than Is, the sequence is guaranteed to be compressed by an 
amount linear in n. 

Continuous Sample Spaces How does the correspondence work for continuous- 
valued XI In this tutorial we only consider P on X such that P admits a density 5 . 
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The P that corresponds to L minimizes expected codelength 

Let P be a distribution on (finite, countable or continuous-valued) Z and let L be 

defined by 

L:= argmin E P [L{Z)]. (2.5) 
Lec z 


Then L exists, 
to P, with len 


is unique 
gths L(z) 


and is identical to the codelength function corresponding 
= -logP(z). 



Figure 2.3: The second most important observation of this tutorial. 

Whenever in the following we make a general statement about sample spaces X and 
distributions P, X may be finite, countable or any subset of R , for any integer I > 1, 
and P(x) represents the probability mass function or density of P, as the case may 
be. In the continuous case, all sums should be read as integrals. The correspon- 
dence between probability distributions and codes may be extended to distributions on 
continuous-valued X: we may think of L{x n ) := — logP(x n ) as a code-length function 
corresponding to Z = X n encoding the values in X n at unit precision; here P{x n ) is 
the density of x n according to P. We refer to [Cover and Thomas 1991] for further 
details. 

2.2.3 The Information Inequality - Codelengths & Probabilities, II 

In the previous subsection, we established the first fundamental relation between prob- 
ability distributions and codelength functions. We now discuss the second relation, 
which is nearly as important. 

In the correspondence to codelength functions, probability distributions were treated 
as mathematical objects and nothing else. That is, if we decide to use a code C to en- 
code our data, this definitely does not necessarily mean that we assume our data to be 
drawn according to the probability distribution corresponding to L: we may have no 
idea what distribution generates our data; or conceivably, such a distribution may not 
even exist 6 . Nevertheless, if the data are distributed according to some distribution 
P, then the code corresponding to P turns out to be the optimal code to use, in an ex- 
pected sense - see Figure 12*31 This result may be recast as follows: for all distributions 
P and Q with Q / P, 

E P [-logQ(X)} > Ep[-logP(X)}. 

In this form, the result is known as the information inequality. It is easily proved using 
concavity of the logarithm [Cover and Thomas 199J] . 

The information inequality says the following: suppose Z is distributed according 
to P ('generated by P'). Then, among all possible codes for Z, the code with lengths 
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— log P(Z) 'on average' gives the shortest encodings of outcomes of P. Why should 
we be interested in the average? The law of large numbers |Feller 19 68 implies that, 
for large samples of data distributed according to P, with high P-probability, the code 
that gives the shortest expected lengths will also give the shortest actual codelengths, 
which is what we are really interested in. This will hold if data are i.i.d., but also more 
generally if P defines a 'stationary and ergodic' process. 

Example 2.6 Let us briefly illustrate this. Let P* , Qa and Qb be three proba- 
bility distributions on X, extended to Z — X n by independence. Hence P*(x n ) = 
Q P* (xi ) and similarly for Qa and Qb ■ Suppose we obtain a sample generated by 
P* . Mr. A and Mrs. B both want to encode the sample using as few bits as possible, 
but neither knows that P* has actually been used to generate the sample. A decides 
to use the code corresponding to distribution Qa and B decides to use the code cor- 
responding to Qb- Suppose that Ep*{— log Qa(X)] < Ep, [—log Qb(X)]. Then, 
by the law of large numbers , with P*-probability 1, rT 1 [— log Qj (Xi , . . . , X n )] — > 
E P .[-logQ j (X)], for both j G {A,B} (note -log Qj(X n ) = -E?=i 1 °g<fc(- x '0)- 
It follows that, with probability 1, Mr. A will need less (linearly in n) bits to 
encode X\, . , . , X n than Mrs. B. 

The qualitative content of this result is not so surprising: in a large sample generated 
by P, the frequency of each x 6 X will be approximately equal to the probability 
P(x). In order to obtain a short codelength for x n , we should use a code that assigns a 
small codelength to those symbols in X with high frequency (probability), and a large 
codelength to those symbols in X with low frequency (probability). 

Summary and Outlook In this section we introduced (prefix) codes and thoroughly 
discussed the relation between probabilities and codelengths. We are now almost ready 
to formalize a simple version of MDL - but first we need to review some concepts of 
statistics. 

2.3 Statistical Preliminaries and Example Models 

In the next section we will make precise the crude form of MDL informally presented 
in Section II .Ml We will freely use some convenient statistical concepts which we review 
in this section; for details see, for example, |Casella an d Bcrge r 1990| . We also describe 
the model class of Markov chains of arbitrary order, which we use as our running 
example. These admit a simpler treatment than the polynomials, to which we return 
in Section [2. 



Statistical Preliminaries A probabilistic model 7 Ai is a set of probabilistic sources. 
Typically one uses the word 'model' to denote sources of the same functional form. We 
often index the elements P of a model M using some parameter 6. In that case we 
write P as P(- \ 9), and M. as M. = {P{- \ 6) \ 6 £ 6}, for some parameter space 0. If 
M can be parameterized by some connected QCl 1 for some k > 1 and the mapping 
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9 — > P(- | 9) is smooth (appropriately defined), we call M a parametric model or family. 
For example, the model A4 of all normal distributions on X = M is a parametric model 
that can be parameterized by 9 = (/i, a 2 ) where \i is the mean and a 2 is the variance of 
the distribution indexed by 9. The family of all Markov chains of all orders is a model, 
but not a parametric model. We call a model A4 an i.i.d. model if, according to all 
P G M, X\,X2, ■ ■ ■ are i.i.d. We call A4 k-dimensional if k is the smallest integer k so 
that M. can be smoothly parameterized by some 9CM . 

For a given model M and sample D = x n , the maximum likelihood (ML) P is 
the P G M maximizing P(x n ). For a parametric model with parameter space O, 
the maximum likelihood estimator 9 is the function that, for each n, maps x n to the 
9 G that maximizes the likelihood P(x n \ 9). The ML estimator may be viewed as 
a 'learning algorithm'. This is a procedure that, when input a sample x n of arbitrary 
length, outputs a parameter or hypothesis P n £ M. We say a learning algorithm is 
consistent relative to distance measure d if for all P* G A4, if data are distributed 
according to P*, then the output P n converges to P* in the sense that d(P*,P n ) — > 
with ^-probability 1. Thus, if P* is the 'true' state of nature, then given enough data, 
the learning algorithm will learn a good approximation of P* with very high probability. 



Example 2.7 [Markov and Bernoulli models] Recall that a A;-th order Markov 
chain on X = {0, 1} is a probabilistic source such that for every n > k, 

-* \-A-n — J- | -A-n— 1 — ^n—1) • • • j -™-n— k — ■^n—k) — 

P(X n = 1 | X n _i = x n -i, ..., X n _ k = x n - k , ...,X 1 = xi). (2.6) 

That is, the probability distribution on X n depends only on the k symbols preceding n. 
Thus, there are 2 possible distributions of X n , and each such distribution is identified 
with a state of the Markov chain. To fully identify the chain, we also need to specify the 
starting state, defining the first k outcomes X\ , . . . , X^ . The k-th order Markov model 
is the set of all k-th order Markov chains, i.e. all sources satisfying (|2.fi|) equipped with 
a starting state. 

The special case of the 0-th order Markov model is the Bernoulli or biased coin 
model, which we denote by B^' We can parameterize the Bernoulli model by a param- 
eter 9 G [0, 1] representing the probability of observing a 1. Thus, £>' ' = {P(- \ 9) \ 9 G 
[0, 1]}, with P{x n | 9) by definition equal to 



P(x n | 9) =f[P(xi | 9) = n W(l - 9) n W, 



i=l 

where rim stands for the number of Is, and rar i for the number of 0s in the sample. 
Note that the Bernoulli model is i.i.d. The log-likelihood is given by 

log P(x n | 9) = n [x] log 9 + n [0] log(l - 9). (2.7) 
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Taking the derivative of (|2.7j) with respect to 6, we see that for fixed x n , the log- 
likelihood is maximized by setting the probability of 1 equal to the observed frequency. 
Since the logarithm is a monotonically increasing function, the likelihood is maximized 
at the same value: the ML estimator is given by 6(x n ) = rim/n. 

Similarly, the first-order Markov model B^ 1 ' can be parameterized by a vector 9 = 
(%|0])%|ll) £ [0) I] 2 together with a starting state in {0, 1}. Here 0\\u] represents the 
probability of observing a 1 following the symbol j. The log- likelihood is given by 

logP(x n | 0) =7i[i|i]log0[i|i] +n[o[i]log(l-0[ 1 [i])+n[i| O ]log0[i|o] +n [0 |o]log(l - 0[i| O ]), 

n\i\j\ denoting the number of times outcome i is observed in state (previous outcome) 
j. This is maximized by setting = (0[i|o]>0[i|il)) with 0rju-i = nuyi = ny^/ny^ set to 
the conditional frequency of % preceded by j. In general, a k-th order Markov chain has 
2 k parameters and the corresponding likelihood is maximized by setting the parameter 
0[j|ji equal to the number of times i was observed in state j divided by the number of 
times the chain was in state j. 

Suppose now we are given data D = x n and we want to find the Markov chain that 
best explains D. Since we do not want to restrict ourselves to chains of fixed order, we 
run a large risk of overfitting: simply picking, among all Markov chains of each order, 
the ML Markov chain that maximizes the probability of the data, we typically end up 
with a chain of order n — 1 with starting state given by the sequence xi, . . . , x n -i, and 
P(X n = x n | X n_1 = x n_1 ) = 1. Such a chain will assign probability 1 to x n . Below 
we show that MDL makes a more reasonable choice. 



2.4 Crude MDL 

Based on the information-theoretic (Section 12.2(1 and statistical ( Section I2.3J) prelimi- 
naries discussed before, we now formalize a first, crude version of MDL. 

Let A4 be a class of probabilistic sources (not necessarily Markov chains). Suppose 
we observe a sample D = (xi, . . . ,x n ) G X n . Recall 'the crude 8 two-part code MDL 
Principle' from Section fl .31 page 111! 



Crude 9 , Two-part Version of MDL Principle 

Let TiS 1 ' ,'HS 2 ' , ... be a set of candidate models. The best point hypothesis H G 
Ti^- 1 ' U US 2 ' U . . . to explain data D is the one which minimizes the sum L(H) + 
L(D\H), where 

• L(H) is the length, in bits, of the description of the hypothesis; and 

• L(D\H) is the length, in bits, of the description of the data when encoded 
with the help of the hypothesis. 

The best model to explain D is the smallest model containing the selected H . 
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In this section, we implement this crude MDL Principle by giving a precise definition 
of the terms L(H) and L(D\H). To make the first term precise, we must design a 
code C\ for encoding hypotheses H such that L(H) = Lq^H). For the second term, 
we must design a set of codes C^h (one for each H £ M.) such that for all D G X n , 
L(D\H) = Lc 2 H (D). We start by describing the codes Ci,h- 

2.4.1 Description Length of Data given Hypothesis 

Given a sample of size n, each hypothesis H may be viewed as a probability distribution 
on X n . We denote the corresponding probability mass function by P(- \ H). We need 
to associate with P(- \ H) a code, or really, just a codelength function for X n . We 
already know that there exists a code with length function L such that for all x n S X n , 
L(x n ) = —logP(x n | H). This is the code that we will pick. It is a natural choice for 
two reasons: 

1. With this choice, the code length L(x n \ H) is equal to minus the log-likelihood 
of x n according to H, which is a standard statistical notion of 'goodness-of-fit'. 

2. // the data turn out to be distributed according to P, then the code L(- | H) will 
uniquely minimize the expected code length (Section I2.2J) . 

The second item implies that our choice is, in a sense, the only reasonable choice 10 . 
To see this, suppose M is a finite i.i.d. model containing, say, M distributions. 
Suppose we assign an arbitrary but finite code length L(H) to each H € AT Sup- 
pose Xi,X<2,... are actually distributed i.i.d. according to some 'true' H* E M.. 
By the reasoning of Example l2.6l we see that MDL will select the true distribution 
P(- | H*) for all large n, with probability 1. This means that MDL is consistent 
for finite M.. If we were to assign codes to distributions in some other manner 
not satisfying L(D \ H) = — \og P(D \ H), then there would exist distributions 
P(- | H) such that L(D\H) ^ -\ogP(D\H). But by Figure O there must be 
some distribution P(- | H') with L{-\H) = -logP(- | H'). Now let M = {H,H'} 
and suppose data are distributed according to P(- | H'). Then, by the reasoning 
of Example ESI MDL would select H rather than H' for all large n\ Thus, MDL 
would be inconsistent even in this simplest of all imaginable cases - there would 
then be no hope for good performance in the considerably more complex situations 
we intend to use it for 11 . 

2.4.2 Description Length of Hypothesis 

In its weakest and crudest form, the two-part code MDL Principle does not give any 
guidelines as to how to encode hypotheses (probability distributions). Every code for 
encoding hypotheses is allowed, as long as such a code does not change with the sample 
size n. 

To see the danger in allowing codes to depend on n, consider the Markov chain 
example: if we were allowed to use different codes for different n, we could use, for 
each n, a code assigning a uniform distribution to all Markov chains of order n — 1 
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with all parameters equal to or 1. Since there are only a finite number (2 n_1 ) of 
these, this is possible. But then, for each n, x n G X n , MDL would select the ML 
Markov chain of order n — 1. Thus, MDL would coincide with ML and, no matter 
how large n, we would overfit. 

Consistency of Two-part MDL Remarkably, if we fix an arbitrary code for all 
hypotheses, identical for all sample sizes n, this is sufficient to make MDL consistent 12 
for a wide variety of models, including the Markov chains. For example, let L be 
the length function corresponding to some code for the Markov chains. Suppose some 
Markov chain P* generates the data such that L(P*) < oo under our coding scheme. 
Then, broadly speaking, for every P* of every order, with probability 1 there exists 
some no such that for all samples larger than no, two-part MDL will select P* - here 
no may depend on P* and L. 

While this results indicates that MDL may be doing something sensible, it certainly 
does not justify the use of arbitrary codes - different codes will lead to preferences of 
different hypotheses, and it is not at all clear how a code should be designed that leads 
to good inferences with small, practically relevant sample sizes. 

Barron and Cover [1991| have developed a precise theory of how to design codes C\ 



in a 'clever' way, anticipating the developments of 'refined MDL'. Practitioners have 
often simply used 'reasonable' coding schemes, based on the following idea. Usually 
there exists some 'natural' decomposition of the models under consideration, M = 
Ufc>o-^ where the dimension of Ai^ k ' grows with k but is not necessarily equal to 
k. In the Markov chain example, we have B = |J B^ k > where B^ k ' is the &:-th order, 
2 -parameter Markov model. Then within each submodel MS ' , we may use a fixed- 
length code for 9 £ 0W. Since the set 0^" is typically a continuum, we somehow need 
to discretize it to achieve this. 

Example 2.8 [a Very Crude Code for the Markov Chains] We can describe 
a Markov chain of order k by first describing k, and then describing a parameter 
vector 9 £ [0, l] k with k! = 2 k . We describe k using our simple code for the integers 
( Example 12. 4j) . This takes 21og/c + 1 bits. We now have to describe the ^'-component 
parameter vector. We saw in Example 12.71 that for any x n , the best-fitting (ML) k- 
th order Markov chain can be identified with k' frequencies. It is not hard to see that 
these frequencies are uniquely determined by the counts ?t[i|o...oo] > n [i|o...oi] > ■ ■ ■ i n [l|i...li] • 
Each individual count must be in the (n + l)-element set {0, 1, . . . , n}. Since we assume 
n is given in advance 13 , we may use a simple fixed-length code to encode this count, 
taking log(n + 1) bits (Example I2.3J) . Thus, once k is fixed, we can describe such a 
Markov chain by a uniform code using k! log(n + 1) bits. With the code just denned 
we get for any P £ B, indexed by parameter 0' \ 

L(P) = L(k, e ik) ) = 2logk + l + k log(n + 1), 

so that with these codes, MDL tells us to pick the k, 9^ k ' minimizing 

L{k,9 ik) )+L(D | k,9 ik) ) = 2\ogk + l + klog(n + l) - log P{D | k,9 (k) ), (2.8) 
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where the 9^ k > that is chosen will be equal to the ML estimator for A4^ k '. 

Why (not) this code? We may ask two questions about this code. First, why did 
we only reserve codewords for 9 that are potentially ML estimators for the given data? 
The reason is that, given k' = 2 k , the codelength L(D \ k, 9^ k >) is minimized by 9^ k '{D), 
the ML estimator within 9^ k >. Reserving codewords for 0G [0, l] k that cannot be ML 
estimates would only serve to lengthen L(D \ k,9^ ') and can never shorten L(k,9^ '). 
Thus, the total description length needed to encode D will increase. Since our stated 
goal is to minimize description lengths, this is undesirable. 

However, by the same logic we may also ask whether we have not reserved too many 
codewords for 9 £ [0, 1] . And in fact, it turns out that we have: the distance between 
two adjacent ML estimators is 0(l/n). Indeed, if we had used a coarser precision, only 
reserving codewords for parameters with distances 0{\/y/n), we would obtain smaller 
code lengths - (|'2.8(l would become 

L(k,9^) + L(D | fc,0 (fc) ) = - log P(.D | M (fc) ) + |logn + c fc , (2.9) 

where c& is a small constant depending on k, but not n [Barron and Cover 1991J . In 
Section [2.61 we show that Q2.9|) is in some sense 'optimal'. 

The Good News and the Bad News The good news is (1) we have found a 
principled, non-arbitrary manner to encode data D given a probability distribution 
H, namely, to use the code with lengths — logP(D | H); and (2), asymptotically, 
any code for hypotheses will lead to a consistent criterion. The bad news is that we 
have not found clear guidelines to design codes for hypotheses H e M.. We found 
some intuitively reasonable codes for Markov chains, and we then reasoned that these 
could be somewhat 'improved', but what is conspicuously lacking is a sound theoretical 
principle for designing and improving codes. 

We take the good news to mean that our idea may be worth pursing further. We take 
the bad news to mean that we do have to modify or extend the idea to get a meaningful, 
non-arbitrary and practically relevant model selection method. Such an extension was 
already suggested in Rissanen's early works [Rissanen 19781 IRissanen 1983] and refined 
by Barron and Cover [1991J. However, in these works, the principle was still restricted 



to two-part codes. To get a fully satisfactory solution, we need to move to 'universal 
codes', of which the two-part codes are merely a special case. 

2.5 Information Theory II: Universal Codes and Models 

We have just indicated why the two-part code formulation of MDL needs to be refined. 
It turns out that the key concept we need is that of universal coding. Broadly speaking, 
a code L that is universal relative to a set of candidate codes C allows us to compress 
every sequence x n almost as well as the code in C that compresses that particular 
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sequence most. Two-part codes are universal ( Section 12.5.1(1 . but there exist other 
universal codes such as the Bayesian mixture code (Section 12.5.2(1 and the Normalized 
Maximum Likelihood (NML) code (Section 12.5.3(1 . We also discuss universal models, 
which are just the probability distributions corresponding to universal codes. In this 
section, we are not concerned with learning from data; we only care about compressing 
data as much as possible. We reconnect our findings with learning in Section [2.61 

Coding as Communication Like many other topics in coding, 'universal coding' 
can best be explained if we think of descriptions as messages: we can always view a 
description as a message that some sender or encoder, say Mr. A, sends to some receiver 
or decoder, say Mr. B. Before sending any messages, Mr. A and Mr. B meet in person. 
They agree on the set of messages that A may send to B. Typically, this will be the set 
X n of sequences x\,. . . ,x n , where each xi is an outcome in the space X . They also 
agree upon a (prefix) code that will be used by A to send his messages to B. Once this 
has been done, A and B go back to their respective homes and A sends his messages 
to B in the form of binary strings. The unique decodability property of prefix codes 
implies that, when B receives a message, he should always be able to decode it in a 
unique manner. 

Universal Coding Suppose our encoder/sender is about to observe a sequence x n £ 
X n which he plans to compress as much as possible. Equivalently, he wants to send 
an encoded version of x n to the receiver using as few bits as possible. Sender and 
receiver have a set of candidate codes £ for X n available 14 . They believe or hope that 
one of these codes will allow for substantial compression of x n . However, they must 
decide on a code for X n before sender observes the actual x n , and they do not know 
which code in C will lead to good compression of the actual x n . What is the best 
thing they can do? They may be tempted to try the following: upon seeing x n , sender 
simply encodes/sends x n using the L G C that minimizes L{x n ) among all L G C. 
But this naive scheme will not work: since decoder/receiver does not know what x n 
has been sent before decoding the message, he does not know which of the codes in 
C has been used by sender /encoder. Therefore, decoder cannot decode the message: 
the resulting protocol does not constitute a uniquely decodable, let alone a prefix code. 
Indeed, as we show below, in general no code L exists such that for all x n £ X n , 
L(x n ) < min^ e £ L{x n ): in words, there exists no code which, no matter what x n is, 
always mimics the best code for x n . 

Example 2.9 Suppose we think that our sequence can be reasonably well-compressed 
by a code corresponding to some biased coin model. For simplicity, we restrict ourselves 
to a finite number of such models. Thus, let C = {L\, . . . , Lg} where L\ is the code 
length function corresponding to the Bernoulli model P(- \ 9) with parameter 9 = 0.1, 
L2 corresponds to 9 = 0.2 and so on. From (|2.7() we see that, for example, 

L 8 (x n ) = -logP(x n |0.8) = -n [0 ]log0.2-n[i]log0.8 
L 9 (x n ) = - log P(x n J 0.9) = -n [0 ]log0.1-rt[i]log0.9. 
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Both L$(x n ) and Lg(x n ) are linearly increasing in the number of Is in x n . However, if 
the frequency m/n is approximately 0.8, then min£ g £ L(x n ) will be achieved for L$. If 
ni/n ~ 0.9 then min^g^ L(x n ) is achieved for Lg. More generally, if m/n ~ j/10 then 
Lj achieves the minimum 15 . We would like to send x n using a code L such that for 
all x n , we need at most L(x n ) bits, where L{x n ) is defined as L(x n ) := min£ g £ L(x n ). 
Since —log is monotonically decreasing, L{x n ) = —logP(x n \ 9{x n )). We already gave 
an informal explanation as to why a code with lengths L does not exist. We can now 
explain this more formally as follows: if such a code were to exist, it would correspond 
to some distribution P. Then we would have for all x n , L(x n ) = — logP(x n ). But, 
by definition, for all x n G X n , L(x n ) < L(x n ) = — logP(x n \6(x n )) where 6{x n ) £ 
{0.1,..., 0.9}. Thus we get for all x n , -logP(x n ) < -logP(x n \ 6{x n )) or P{x n ) > 
P(x n | 6(x n )), so that, since |£| > 1, 

J2 p (x n ) > J2 p ^ n i ^ x ™)) = ^ maxp ( xn i *) > *> ( 2 - 10 ) 

where the last inequality follows because for any two 9i,02 with 0\ ^ $2, there is at 
least one x n with P(x n \ 6\) > P(x n \ 62). (|2.10|) says that P is not a probability 
distribution. It follows that L cannot be a codelength function. The argument can 
be extended beyond the Bernoulli model of the example above: as long as \C\ > 1, 
and all codes in C correspond to a non-defective distribution, (|2.10|) must still hold, 
so that there exists no code L with L(x n ) = L(x n ) for all x n . The underlying reason 
that no such code exists is the fact that probabilities must sum up to something < 1; 
or equivalently, that there exists no coding scheme assigning short code words to many 
different messages - see Example 12.21 

Since there exists no code which, no matter what x n is, always mimics the best code for 
x n , it may make sense to look for the next best thing: does there exist a code which, 
for all x n £ X n , is 'nearly' (in some sense) as good as L(x n )? It turns out that in 
many cases, the answer is yes: there typically exists codes L such that no matter what 
x n arrives, L(x n ) is not much larger than L(x n ), which may be viewed as the code 
that is best 'with hindsight' (i.e., after seeing x n ). Intuitively, codes which satisfy this 
property are called universal codes - a more precise definition follows below. The first 
(but perhaps not foremost) example of a universal code is the two-part code that we 
have encountered in Section f2.41 

2.5.1 Two-part Codes as simple Universal Codes 

Example 2.10 [finite C] Let C be as in Example 12.91 We can devise a code L2_ p 
for all x n G X n as follows: to encode x n , we first encode the j £ {1, . . . , 9} such that 
Lj(x n ) = min£ g £ L(x n ), using a uniform code. This takes log 9 bits. We then encode 
x n itself using the code indexed by j. This takes Lj bits. Note that in contrast to the 
naive scheme discussed in Example 12.91 the resulting scheme properly defines a prefix 
code: a decoder can decode x n by first decoding j, and then decoding x n using Lj. 
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Thus, for every possible x n € X n , we obtain 

L 2 _ p (x n ) =mmL(x n ) +log9. 

For all L £ £, min^n L(x n ) grows linearly in n: ming^n — logP(x n | 0) = —n log 0.9 ~ 
0.15n. Unless n is i>en/ small, no matter what x n arises, the extra number of bits we 
need using L 2 . p compared to L{x n ) is negligible. 

More generally, let C = {L\, . . . ,Lm} where M can be arbitrarily large, and the Lj 
can be any codelength functions we like; they do not necessarily represent Bernoulli 
distributions any more. By the reasoning of Example 12.101 there exists a (two-part) 
code such that for all x n S X n , 

L 2 _ p (x n ) = mmL(x n ) +logM. (2.11) 



In most applications minL(x n ) grows linearly in n, and we see from 1)2.11)1 that, as soon 
as n becomes substantially larger than logM, the relative difference in performance 
between our universal code and L(x n ) becomes negligible. In general, we do not always 
want to use a uniform code for the elements in £; note that any arbitrary code on C 
will give us an analogue of 1)2.1 1J) , but with a worst-case overhead larger than log M - 
corresponding to the largest codelength of any of the elements in C. 

Example 2.11 [Countably Infinite C] We can also construct a 2-part code for ar- 
bitrary countably infinite sets of codes C = {Li,L 2 , ■ ■ .}: we first encode some k using 
our simple code for the integers (Example 12.4)1 . With this code we need 2 log k + \ bits 
to encode integer k. We then encode x n using the code L^. L 2 - p is now defined as the 
code we get if, for any x n , we encode x n using the L/t minimizing the total two-part 
description length 2 log A; + 1 + Lk{x n ). 

In contrast to the case of finite C, there does not exist a constant c any more such 
that for all n, x n S X n , L 2 _ p (x n ) < inf^ e £ L{x n ) + c. Instead we have the following 
weaker, but still remarkable property: for all k, all n, all x n , L 2 _ p (x n ) < L^(x n ) + 
2 log k + 1, so that also, 

L 2 - P (x n ) < inf L(x n ) + 21og k + 1. 

Le{Li,...,L fc } 

For any k, as n grows larger, the code L 2 . p starts to mimic whatever L € {-^i, • • • , L^} 
compresses the data most. However, the larger k, the larger n has to be before this 
happens. 

2.5.2 From Universal Codes to Universal Models 

Instead of postulating a set of candidate codes C, we may equivalently postulate a set 
M. of candidate probabilistic sources, such that C is the set of codes corresponding to 
M. We already implicitly did this in Example 12. 
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The reasoning is now as follows: we think that one of the P G M will assign a high 
likelihood to the data to be observed. Therefore we would like to design a code that, 
for all x n we might observe, performs essentially as well as the code corresponding to 
the best-fitting, maximum likelihood (minimum codelength) P £ Ai for x n . Similarly, 
we can think of universal codes such as the two-part code in terms of the (possibly 
defective, see Section 12.21 and Figure 12.1(1 ) distributions corresponding to it. Such 
distributions corresponding to universal codes are called universal models. The use of 
mapping universal codes back to distributions is illustrated by the Bayesian universal 
model which we now introduce. 

Universal model: Twice Misleading Terminology The words 'universal' and 
'model' are somewhat of a misnomer: first, these codes/models are only 'universal' 
relative to a restricted 'universe' M.. Second, the use of the word 'model' will be 
very confusing to statisticians, who (as we also do in this paper) call a family of 
distributions such as M. a 'model'. But the phrase originates from information 
theory, where a 'model' often refers to a single distribution rather than a family. 
Thus, a 'universal model' is a single distribution, representing a statistical 'model' 
M. 

Example 2.12 [Bayesian Universal Model] Let M be a finite or countable set of 
probabilistic sources, parameterized by some parameter set G. Let W be a distribution 
on 0. Adopting terminology from Bayesian statistics, W is usually called a prior 
distribution. We can construct a new probabilistic source -PRayes by taking a weighted 
(according to W) average or mixture over the distributions in A4. That is, we define 
for all n, x n £ X, 

PB,yes(x n ) : = £ P(x" | 9)W{9). (2.12) 

eee 

It is easy to check that -PBayes is a probabilistic source according to our definition. 
In case O is continuous, the sum gets replaced by an integral but otherwise nothing 
changes in the definition. In Bayesian statistics, -PBayes is called the Bayesian marginal 
likelihood or Bayesian mixture Bernard o and Smith 1994] . To see that -Peayes is a 
universal model, note that for all 9q G Q, 

-\o E P Baycs (x n ) := -log^P(x n | 9)W{9) < -logP(x n \ 9 ) + % (2.13) 

where the inequality follows because a sum is at least as large as each of its terms, 
and ce = — log W{9) depends on 9 but not on n. Thus, -PBayes is a universal model 
or equivalently, the code with lengths — log PBayes is a universal code. Note that the 



derivation in 1)2.13(1 only works if G is finite or countable; the case of continuous is 
treated in Section l2~o1 

Bayes is Better than Two-part The Bayesian model is in a sense superior to the 
two-part code. Namely, in the two-part code we first encode an element of A4 or its 



41 



parameter set using some code Lq. Such a code must correspond to some 'prior' 
distribution W on M. so that the two-part code gives codelengths 

L 2 . p {x n ) = min - log P{x n \0) - log W{9) (2.14) 

where W depends on the specific code Lq that was used. Using the Bayes code with 
prior W, we get as in (|2.13j) . 

-logP B aycs(x n ) = -logVP(/ | 9)W (9) < min - log P(x n \9) - log W{9). 

eee 

The inequality becomes strict whenever P(x n \9) > for more than one value of 9. 
Comparing to Q2.14JI . we see that in general the Bayesian code is preferable over the 
two-part code: for all x n it never assigns codelengths larger than L2- P (x n ), and in many 
cases it assigns strictly shorter codelengths for some x n . But this raises two important 
issues: (1) what exactly do we mean by 'better' anyway? (2) can we say that 'some 
prior distributions are better than others' ? These questions are answered below. 

2.5.3 NML as an Optimal Universal Model 

We can measure the performance of universal models relative to a set of candidate 
sources A\ using the regret: 

Definition 2.13 [Regret] Let Ai be a class of probabilistic sources. Let P be a prob- 
ability distribution on X n (P is not necessarily in Ai). For given x n , the regret of P 
relative to M. is defined as 

-logP(x n )- minj- log P(x n )\. (2.15) 

P&M 

The regret of P relative to Ai for x n is the additional number of bits needed to encode 
x n using the code/distribution P, as compared to the number of bits that had been 
needed if we had used code/distribution in Ai that was optimal ('best-fitting') with 
hind-sight. For simplicity, from now on we tacitly assume that for all the models Ai 
we work with, there is a single 9(x n ) maximizing the likelihood for every x n G X n . In 
that case (|2.15j) simplifies to 

-logP(x n ) -{-logP(a; n I 9{x n ))}. 

We would like to measure the quality of a universal model P in terms of its regret. 
However, P may have small (even < 0) regret for some x n , and very large regret for 
other x n . We must somehow find a measure of quality that takes into account all 
x n £ X n . We take a worst-case approach, and look for universal models P with small 
worst-case regret, where the worst-case is over all sequences. Formally, the maximum 
or worst- case regret of P relative to .M is defined as 

TZ max (P):= max n {-logP(x n ) - {-logP(x n | 9(x n ))}}. 
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If we use 7£ max as our quality measure, then the 'optimal' universal model relative to 
M, for given sample size n, is the distribution minimizing 

min 7£ max (P) = min max |-logP(x n ) - {-logP(x n I 6(x n ))}} (2.16) 

p p x n &X n k ' 

where the minimum is over all defective distributions on X n . The P minimizing (|2.16|) 
corresponds to the code minimizing the additional number of bits compared to code in 
M that is best in hindsight in the worst-case over all possible x n . It turns out that we 
can solve for P in ()2.16j) . To this end, we first define the complexity of a given model 
M as 

COMP n (7W):=log Y^ P{x n \0{x n )). (2.17) 

x n ex n 

This quantity plays a fundamental role in refined MDL, Section \2. HI To get a first idea 
of why COMP n is called model complexity, note that the more sequences x n with 
large P(x n \ 9(x n )), the larger COMP n (.M). In other words, the more sequences that 
can be fit well by an element of Ai, the larger .M's complexity. 

Proposition 2.14 [Shtarkov 1987| Suppose that COMP n (.M) is finite. Then the 
minimax regret \2. 1 6)) is uniquely achieved for the distribution -P nm i given by 

*-,(«■):= P{X " lHX " )] ■ (2-18) 

E y . €X .p(r I «(»")) 

The distribution P n ml is known as the Shtarkov distribution or the normalized maximum 
likelihood (NML) distribution. 

Proof Plug in P nm i in (|2.16|) and notice that for all x n £ X n , 

-logP(x n ) - {-logP(x n I 9(x n ))} = ^ max (P) = COMP n (M), (2.19) 

so that P nm i achieves the same regret, equal to COMP ra (.M), no matter what x n 
actually obtains. Since every distribution P on X n with P ^ P n mi must satisfy P(z n ) < 
Pnm\(z n ) for at least one z n 6 X n , it follows that 

^max(P) > -logP(^) +logP(z" | 9(z n )) > 

- log P nml (z n ) + log P(z n I 6(z n )) = K max (P nml ). 

D 

^nmi is quite literally a 'normalized maximum likelihood' distribution: it tries to assign 
to each x n the probability of x n according to the ML distribution for x n . By (|2.1U|) . 
this is not possible: the resulting 'probabilities' add to something larger than 1. But 
we can normalize these 'probabilities' by dividing by their sum Yly n ex n ^(y™ I $(y n )); 
and then we obtain a probability distribution on X n after all. 
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Whenever X is finite, the sum COMP n (.M) is finite so that the NML distribution 
is well-defined. If X is countably infinite or continuous-valued, the sum COMP n (A4) 
may be infinite and then the NML distribution may be undefined. In that case, there 
exists no universal model achieving constant regret as in (|2.19|) . If M is parametric, 
then P nm \ is typically well-defined as long as we suitably restrict the parameter space. 
The parametric case forms the basis of 'refined MDL' and will be discussed at length 
in the next section. 



Summary: Universal Codes and Models 

Let A4 be a family of probabilistic sources. A universal model in an individual 
sequence sense 16 relative to Ai, in this text simply called a 'universal model for 
At 1 , is a sequence of distributions P^\ P^ 2 \ . . . on X X ,X 2 ,... respectively, such 
that for all P G M, for all e > 0, 



max - 1 - log pW (x n ) - [- log P(x n ) 
x n eX n n [ 



)] > < e as n —>■ oo. 



Multiplying both sides with n we see that P is universal if for every P G Ai, 
the codelength difference — logP(x n ) +logP(a; n ) increases sublinearly in n. If Ai 
is finite, then the two-part, Bayes and NML distributions are universal in a very 
strong sense: rather than just increasing sublinearly, the codelength difference is 
bounded by a constant. 

We already discussed two-part, Bayesian and minimax optimal (NML) universal 
models, but there several other types. We mention prequential universal models 
(Section I2.6.4|) . the Kolmogorov universal model, conditionalized two-part codes 
Rissanen 2001 and Cesaro-average codes |Barron, Rissanen, and Yu 1998 . 



2.6 Simple Refined MDL and its Four Interpretations 

In Section 12.41 we indicated that 'crude' MDL needs to be refined. In Section 12.51 
we introduced universal models. We now show how universal models, in particular 
the minimax optimal universal model P nm \, can be used to define a refined version of 
MDL model selection. Here we only discuss the simplest case: suppose we are given 
data D = (xi,...,x n ) and two models A4 (1) and M^ such that COMP n (.M (1) ) 
and COMP n (.M' 2 )) (Equation 12.17)1 are both finite. For example, we could have 
some binary data and .M^ and .M' 2 ) are the first- and second-order Markov models 
(Example 12 .7|) . both considered possible explanations for the data. We show how to deal 
with an infinite number of models and/or models with infinite COMP„ in Section f2. 71 
Denote by P n mi(' \ -M'-") the NML distribution on X n corresponding to model Av^' ■ 
Refined MDL tells us to pick the model M^' maximizing the normalized maximum 
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likelihood P nm i(.D | M.^'), or, by (|2.18|) . equivalently, minimizing 

- logP nm iP | M {j) ) = - logP(D | 6^\D)) + COW~P n {M {j) ) (2.20) 

From a coding theoretic point of view, we associate with each M^> a code with lengths 
-Pnmi( - I M^'), and we pick the model minimizing the codelength of the data. The 
codelength — logP nm i(-D | M^>) has been called the stochastic complexity of the data 
D relative to model M.^' [Rissanen 1987J . whereas COMP n (A4^^) is called the para- 
metric complexity or model cost of My> (in this survey we simply call it 'complexity'). 
We have already indicated in the previous section that COMP n (A^"^) measures some- 
thing like the 'complexity' of model Msi>. On the other hand, — log P(-D | 9^ '(D)) is 
minus the maximized log-likelihood of the data, so it measures something like (minus) 
fit or error - in the linear regression case, it can be directly related to the mean squared 
error, Section 12.81 Thus, (|2.20[) embodies a trade-off between lack of fit (measured by 
minus log-likelihood) and complexity (measured by COMP n (Al' 3 ')). The confidence 
in the decision is given by the codelength difference 

logP nml (D | M&) - [-log P nml (D | M®)] 

In general, — logP nm i(Z? | «M) can only be evaluated numerically - the only exception 
this author is aware of is when A4 is the Gaussian family, Example 12. 2UI In many cases 
even numerical evaluation is computationally problematic. But the re-interpretations 
of P nm \ we provide below also indicate that in many cases, — log P(D \ Ai) is relatively 
easy to approximate. 

Example 2.15 [Refined MDL and GLRT] Generalized likelihood ratio testing 
|Oasella and Berger 1990| tells us to pick the M.^ maximizing log P(D \ 9^(D)) + 
c where c is determined by the desired type-I and type-II errors. In practice one 
often applies a naive variation 17 , simply picking the model M™) maximizing logP(D \ 
9^\D)). This amounts to ignoring the complexity terms COMP n (.M^) in (|2.2flj) : 
MDL tries to avoid overfitting by picking the model maximizing the normalized rather 
than the ordinary likelihood. The more distributions in A4 that fit the data well, the 
larger the normalization term. 

The hope is that the normalization term COMP n (A^' J ') strikes the right balance 
between complexity and fit. Whether it really does depends on whether COMP„ is 
a 'good' measure of complexity. In the remainder of this section we shall argue that 
it is, by giving four different interpretations of COMP n and of the resulting trade-off 
(t2~2TH) : 

1. Compression interpretation. 

2. Counting interpretation. 

3. Bayesian interpretation. 

4. Prequential (predictive) interpretation. 
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2.6.1 Compression Interpretation 

Rissanen's original goal was to select the model that detects the most regularity in 
the data; he identified this with the 'model that allows for the most compression of 
data x n \ To make this precise, a code is associated with each model. The NML code 
with lengths — logP nm i(- | M.^>) seems to be a very reasonable choice for such a code 
because of the following two properties: 

1. The better the best-fitting distribution in M^> fits the data, the shorter the 
codelength -logP nm i(D | M^). 

2. No distribution in M.^' is given a prior preference over any other distribution, 
since the regret of P nm i(- I MP') is the same for all D £ X n (Equation l)2.19jl ). 
P nm \ is the only complete prefix code with this property, which may be restated 
as: -P nm i treats all distributions within each M^' on the same footing! 

Therefore, if one is willing to accept the basic ideas underlying MDL as first principles, 
then the use of NML in model selection is now justified to some extent. Below we give 
additional justifications that are not directly based on data compression; but we first 
provide some further interpretation of — logP nm i. 

Compression and Separating Structure from Noise We present the following 
ideas in an imprecise fashion - Rissanen and Tabus [2004| recently showed how to make 



them precise. The stochastic complexity of data D relative to M, given by (|2.2U|) can 
be interpreted as the amount of information in the data relative to M, measured in bits. 
Although a one-part codelength, it still consists of two terms: a term COMP n (.M) 
measuring the amount of structure or meaningful information in the data (as 'seen 
through M 1 ), and a term —\ogP(D \ 0(D)) measuring the amount of noise or acci- 
dental information in the data. To see that this second term measures noise, consider 
the regression example, Example 11.21 again. As will be seen in Section 12.81 Equation 
(J2.4UJ) . in that case — logP(D | 0(D)) becomes equal to a linear function of the mean 
squared error of the best-fitting polynomial in the set of fc-th degree polynomials. To 
see that the first term measures structure, we reinterpret it below as the number of 
bits needed to specify a 'distinguishable' distribution in M, using a uniform code on 
all 'distinguishable' distributions. 

2.6.2 Counting Interpretation 

The parametric complexity can be interpreted as measuring (the log of) the number of 
distinguishable distributions in the model. Intuitively, the more distributions a model 
contains, the more patterns it can fit well so the larger the risk of overfitting. However, 
if two distributions are very 'close' in the sense that they assign high likelihood to the 
same patterns, they do not contribute so much to the complexity of the overall model. 
It seems that we should measure complexity of a model in terms of the number of 
distributions it contains that are 'essentially different' (distinguishable), and we now 
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show that COMP n measures something like this. Consider a finite model A\ with 
parameter set = {9±, . . . , 9m}- Note that 

E p(^i<?V))= E E p (^i%) = 

X"£*™ j = l..M x n.§( x n ) = 9j 

E a- E p(^i^))=M-E^V)^«)- 

j=l..M x n :§(x n )^6j 3 

We may think of P(9(x n ) ^ @j\@j) as the probability, according to 9j, that the data 
look as if they come from some 9 ^ 9j. Thus, it is the probability that 9j is mis- 
taken for another distribution in 0. Therefore, for finite Ai, Rissanen's model com- 
plexity is the logarithm of the number of distributions minus the summed probability 
that some 9j is 'mistaken' for some 9 ^ 9j. Now suppose A4 is i.i.d. By the law of 
large numbers |Feller 1968J . we immediately see that the 'sum of mistake probabilities' 
Y2j P{9{x n ) t^ 9j\9j) tends to as n grows. It follows that for large n, the model 
complexity converges to logM. For large n, the distributions in M. are 'perfectly dis- 
tinguishable' (the probability that a sample coming from one is more representative 
of another is negligible), and then the parametric complexity COMP„(.M) of A\ is 
simply the log of the number of distributions in Ai. 

Example 2.16 [NML vs. Two-part Codes] Incidentally, this shows that for finite 
i.i.d. Ai, the two-part code with uniform prior W on .M is asymptotically minimax 
optimal: for all n, the regret of the 2-part code is logM (Equation 12. llj) . whereas we 
just showed that for n — ► oo, lZ(P nm \) = COMP„(A4) — » logM. However, for small n, 
some distributions in A\ may be mistaken for one another; the number of distinguishable 
distributions in A\ is then smaller than the actual number of distributions, and this is 
reflected in COMP n (.M) being (sometimes much) smaller than logM. 

For the more interesting case of parametric models, containing infinitely many distribu- 
tions, Balasubramanian 199?J 12004] has a somewhat different counting interpretation 
of COMP n (A4) as a ratio between two volumes. Rissanen and Tabus [2004| give a 



more direct counting interpretation of COMP n (.M). These extensions are both based 
on the asymptotic expansion of P nm \, which we now discuss. 

Asymptotic Expansion of P nm \ and COMP„ Let .M be a fc-dimensional para- 
metric model. Under regularity conditions on A\ and the parameterization — » 
Ai, to be detailed below, we obtain the following asymptotic expansion of COMP n 
jR.issa.nen 1996llTa,keiichi and Barron 1 997UTa,keuchi and Barron 1998IITakenchi 2000]: 

COMP n (.M) = Jlog^+log I vM^ + o(l). (2.21) 

Here k is the number of parameters (degrees of freedom) in model .M, n is the sample 
size, and o(l) — > as n — » oo. \I(9)\ is the determinant of the k x k Fisher information 
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matrix 18 I evaluated at 0. In case At is an i.i.d. model, / is given by 
This is generalized to non-i.i.d. models as follows: 



lij(0*):= lim Ie*./--^— logP(X n \e)\ 



Q2.21JI only holds if the model A4, its parameterization G and the sequence x\,X2, ■ ■ ■ 
all satisfy certain conditions. Specifically, we require: 



1. COMP n (M) < oo and / y/\I{0)\d9 < 



oo; 



2. 6{x n ) does not come arbitrarily close to the boundary of Q: for some e > 0, for 
all large n, 9(x n ) remains farther than e from the boundary of G. 

3. A\ and G satisfy certain further conditions. A simple sufficient condition is that 
A\ be an exponential family Casella and Berger 1990|. Roughly, this is a family 



that can be parameterized so that for all x, P(x \ j3) = exp(/3t(x))f(x)g(/3), 
where t : X — > R is a function of X. The Bernoulli model is an exponential 
family, as can be seen by setting (5 := ln(l — 0) — ln# and t{x) = x. Also 
the multinomial, Gaussian, Poisson, Gamma, exponential, Zipf and many other 
models are exponential families; but, for example, mixture models are not. 

More general conditions are given by ITakeuchi and Barronl [19971 119981 [2000 . Essen- 
tially, if Ai behaves 'asymptotically' like an exponential family, then ([2.21(1 still holds. 
For example, ([2.21J) holds for the Markov models and for AR and ARMA processes. 

Example 2.17 [Complexity of the Bernoulli Model] The Bernoulli model B^ ' 
can be parameterized in a 1-1 way by the unit interval (Example 12. 7[) . Thus, k = 1. 
An easy calculation shows that the Fisher information is given by 8(1 — 9). Plugging 
this into (|2~2"T|) and calculating / ^10(1-0)1^0 gives 

COMP„(B (0) ) = - log n + - log | - 3 + o(l) = - log n - 2.674251935 + o(l). 



Computing the integral of the Fisher determinant is not easy in general. Hanson and Fu [2004 
compute it for several practically relevant models. 

Whereas for finite Ai, COMP n (A4) remains finite, for parametric models it gener- 
ally grows logarithmically in n. Since typically —\ogP(x n \ 9(x n )) grows linearly in 
n, it is still the case that for fixed dimensionality k (i.e. for a fixed A\ that is k- 
dimensional) and large n, the part of the codelength — log-P nm i(:E n | A4) due to the 
complexity of A4 is very small compared to the part needed to encode data x n with 
9(x n ). The term J„ y/\I(6)\dO may be interpreted as the contribution of the functional 
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form of M. to the model complexity |Balasubramanian 2004J . It does not grow with 
n so that, when selecting between two models, it becomes irrelevant and can be ig- 
nored for very large n. But for small n, it can be important, as can be seen from 
Example 11.41 Fechner's and Stevens' model. Both models have two parameters, yet 
the Jq ^/\I(9)\dO-tenn is much larger for Fechner's than for Stevens' model. In the 
experiments of Myung, Balasubramanian, and Pitt [2000|, the parameter set was re- 



stricted to < a < oo, < b < 3 for Stevens' model and < a < oo, < b < oo for 
Fechner's model. The variance of the error Z was set to 1 in both models. With these 
values, the difference in f Q sj\l(6)\d6 is 3.804, which is non-negligible for small sam- 
ples. Thus, Stevens' model contains more distinguishable distributions than Fechner's, 



and is better able to capture random noise in the data - as Townsend [1975| already 
speculated almost 30 years ago. Experiments suggest that for regression models such as 
Stevens' and Fechner's', as well as for Markov models and general exponential families, 
the approximation (|2,21|) is reasonably accurate already for small samples. But this is 
certainly not true for general models: 



The Asymptotic Expansion of COMP n Should Be Used with Care! 

(J2.21J) does not hold for all parametric models; and for some models for which it 
does hold, the o(l) term may only converge to only for quite large sample sizes. 
Foster and Stine |1999| 12004] show that the approximation (|2,21|) is, in general, 
only valid if k is much smaller than n. 



Two-part codes and COMP n (.M) We now have a clear guiding principle (mini- 
max regret) which we can use to construct 'optimal' two-part codes, that achieve the 
minimax regret among all two-part codes. How do such optimal two-part codes compare 
to the NML codelength? Let M be a /c-dimensional model. By slightly adjusting the 
arguments of [Barron and Cover 1991| Appendix], one can show that, under regularity 
conditions, the minimax optimal two-part code P2-P achieves regret 

- logP 2 . p (x" I M) + \ogP{x n I 6{x n )) = - log — + log / y/\lW\ffl + f(k) + o(l), 

where / : N — ► R is a bounded positive function satisfying lim^oo f(k) = 0. Thus, for 
large k, optimally designed two-part codes are about as good as NML. The problem 
with two-part code MDL is that in practice, people often use much cruder codes with 
much larger minimax regret. 

2.6.3 Bayesian Interpretation 

The Bayesian method of statistical inference provides several alternative approaches to 
model selection. The most popular of these is based on Bayes factors |Kass and Raftery 1995| . 
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The Bayes factor method is very closely related to the refined MDL approach. As- 
suming uniform priors on models M^ l > and Ai^ 2 ', it tells us to select the model with 
largest marginal likelihood PBayes(^ n I Ai^'), where -PBayes is as in (J2.12J) . with the sum 
replaced by an integral, and w^' is the density of the prior distribution on M.^>: 

PB,yes(x n \ M®) = Jp(x n \ 9)w^(9)d9. (2.22) 

Ai is Exponential Family Let now -Peayes = -PBayes (• I Ai) for some fixed model Ai. 
Under regularity conditions on A4, we can perform a Laplace approximation of the in- 
tegral in (|2.12|) . For the special case that Ai is an exponential family, we obtain the fol- 
lowing expression for the regret |Jeffreys 1961} ISchwarz 1978| Kass and Raftery 1 995; 
Bala subramanian 1997J : 



k re 
— log — 
2 S 2vr 



log P Bayes (x n ) - [- log P(x n | 9{x n ))] = - log — - log w(9) + log V \I(9)\ + o(l) 



(2.23) 

Let us compare this with (|2.21|) . Under the regularity conditions needed for (|2.21|) , the 
quantity on the right of (|2T25|) is within 0(1) of COMP„(M). Thus, the code length 
achieved with -Peayes is within a constant of the minimax optimal — log P nm \(x n ). Since 
— logP(x n | 9(x n )) increases linearly in re, this means that if we compare two models 
.A/r 1 - 1 and .M^, then for large enough n, Bayes and refined MDL select the same 
model. If we equip the Bayesian universal model with a special prior known as the 
Jeffreys-Bernardo prior Jeffreys 19461 iBernardo and Smith 1994J . 

— w = srSb (2 - 24) 

then Bayes and refined NML become even more closely related: plugging in (|2.24|) into 
(J2.23JI . we find that the right-hand side of (|2.23|) now simply coincides with (|2.21[) . A 
concrete example of Jeffreys' prior is given in Example 12.191 Jeffreys introduced his 
prior as a 'least informative prior', to be used when no useful prior knowledge about 
the parameters is available Jeffreys 1946 . As one may expect from such a prior, it 



is invariant under continuous 1-to-l reparameterizations of the parameter space. The 
present analysis shows that, when Ai is an exponential family, then it also leads to 
asymptotically minimax codelength regret: for large re, refined NML model selection 
becomes indistinguishable from Bayes factor model selection with Jeffreys ' prior. 

.M is not an Exponential Family Under weak conditions on Ai, @ and the se- 
quence x n , we get the following generalization of (|2.23|) : 



log/Wes^l-M) 

r n I 0(.r n ^ + ~lnff 

2vr 
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log P{x n | 9{x n )) + ~ log £- - log w{9) + log J\l(x n )\ + o(l). (2.25) 



Here I(x n ) is the so-called observed information, sometimes also called observed Fisher 
information; see |Kass and Voss 1997] for a definition. If M. is an exponential family, 
then the observed Fisher information at x n coincides with the Fisher information at 
9{x n ), leading to (|2.23|) . If M. is not exponential, then if data are distributed according 
to one of the distributions in A4, the observed Fisher information still converges with 
probability 1 to the expected Fisher information. If M is neither exponential, nor 
are the data actually generated by a distribution in M., then there may be 0(1)- 
discrepancies between — log P nm \ and — log -PBayes even for large n. 

2.6.4 Prequential Interpretation 

Distributions as Prediction Strategies Let P be a distribution on X n . Applying 
the definition of conditional probability, we can write for every x": 



P(x 

i=i 

so that also 



^ n )=n7^=n^i^ 1 )' ( 2 - 26 ) 



-\ogP{x n ) = ^-logP(x; | x^ 1 ) (2.27) 

Let us abbreviate P(Xi = ■ \ X^ 1 = x^ 1 ) to P(Xi \ x^ 1 ). P(X t \ x^ 1 ) (capital 
Xi) is the distribution (not a single number) of X{ given x J_1 ; P(xi \ x l ~ 1 ) (lower case 
x{) is the probability (a single number) of actual outcome x$ given x 4 " 1 . We can think 
of — logP(xj I x 4-1 ) as the loss incurred when predicting Xi based on the conditional 
distribution P(Xi | x 1 ^ 1 ), and the actual outcome turned out to be Xj. Here 'loss' is 
measured using the so-called logarithmic score, also known simply as 'log loss'. Note 
that the more likely x is judged to be, the smaller the loss incurred when x actu- 
ally obtains. The log loss has a natural interpretation in terms of sequential gambling 
[Cover and Thomas 1991] . but its main interpretation is still in terms of coding: by 



IJ2.27JI . the codelength needed to encode x n based on distribution P is just the accu- 
mulated log loss incurred when P is used to sequentially predict the i-th outcome based 
on the past [i — l)-st outcomes. 

(|2.2fi|) gives a fundamental re-interpretation of probability distributions as predic- 
tion strategies, mapping each individual sequence of past observations x%, . . . , Xj_i to a 
probabilistic prediction of the next outcome P(Xi \ x' L ~ l ). Conversely, l[2.2fi|) also shows 
that every probabilistic prediction strategy for sequential prediction of n outcomes may 
be thought of as a probability distribution on X n : a strategy is identified with a function 
mapping all potential initial segments x J_1 to the prediction that is made for the next 
outcome Xi, after having seen x J_1 . Thus, it is a function S : Uo<i <n X l — ► Vx, where 
Vx is the set of distributions on X. We can now define, for each i < n, all x l S X 1 , 
P(Xi | x l ~ 1 ):=S(x l ~ 1 ). We can turn these partial distributions into a distribution on 
X n by sequentially plugging them into (|2.26|) . 
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Log Loss for Universal Models Let M. be some parametric model and let P 
be some universal model/code relative to M. What do the individual predictions 
P(Xi | x 4-1 ) look like? Readers familiar with Bayesian statistics will realize that 
for i.i.d. models, the Bayesian predictive distribution PBayes^i | x l ~ l ) converges to 
the ML distribution P(- \ 9{x % ~ 1 )); Example 12.191 provides a concrete case. It seems 
reasonable to assume that something similar holds not just for PBayes but for universal 
models in general. This in turn suggests that we may approximate the conditional 
distributions P(Xi \ x l ~ l ) of any 'good' universal model by the maximum likelihood 
predictions P(- | 6(x l ~ 1 )). Indeed, we can recursively define the 'maximum likelihood 
plug-in' distribution P p i ug _i n by setting, for i = 1 to n, 

Ppiug-m(^ = • I j-^Pix = ■ | e^' 1 ))- (2-28) 

Then 

n 

-logPpm g -in(x n ):=X>l°gP(^ I #V -1 ))- (2-29) 

Indeed, it turns out that under regularity conditions on M and x n , 

-logPpiug-inOO = -lo E P(x n | e(x n )) + ^logn + 0(l), (2.30) 

showing that Ppiug-in acts as a universal model relative to A4, its performance being 
within a constant of the minimax optimal P nm \. The construction of Ppi U g_i n can be 
easily extended to non-i.i.d. models, and then, under regularity conditions, (|2.3U|) still 
holds; we omit the details. 

We note that all general proofs of l|2.3L)[l that we are aware of show that H2.30fl holds 
with probability 1 or in expectation for sequences generated by some distribution 
in M |Rissanen 19841 IRissanen 19861 IRissanen 198 9 . Note that the expressions 
<|2.21[l and (|2.25() for the regret of P nm i and -PBayes hold for a much wider class 
of sequences; they also hold with probability 1 for i.i.d. sequences generated by 
sufficiently regular distributions outside M. Not much is known about the regret 
obtained by P p i U g-in for such sequences, except for some special cases such as if M. 
is the Gaussian model. 

In general, there is no need to use the ML estimator 6{x % ~ 1 ) in the definition (|2.28|) . 
Instead, we may try some other estimator which asymptotically converges to the ML 
estimator - it turns out that some estimators considerably outperform the ML estima- 
tor in the sense that (|2.29|) becomes a much better approximation of — logP nm i, see 
Example 12.191 Irrespective of whether we use the ML estimator or something else, 
we call model selection based on ()2.29|) the prequential form of MDL in honor of A. P. 
Dawid's 'prequential analysis', Section [2.91 It is also known as 'predictive MDL'. The 
validity of (|2.3U|) was discovered independently by Rissanen [1984] and Dawid Jp84 j. 

The prequential view gives us a fourth interpretation of refined MDL model selec- 
tion: given models 7W' 1 ' and J\A^ 2 \ MDL tells us to pick the model that minimizes the 
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accumulated prediction error resulting from sequentially predicting future outcomes 
given all the past outcomes. 

Example 2.18 [GLRT and Prequential Model Selection] How does this differ 
from the naive version of the generalized likelihood ratio test (GLRT) that we intro- 
duced in Example 12. 15I / In GLRT, we associate with each model the log-likelihood 
(minus log loss) that can be obtained by the ML estimator. This is the predictor 
within the model that minimizes log loss with hindsight, after having seen the data. 
In contrast, prequential model selection associates with each model the log-likelihood 
(minus log loss) that can be obtained by using a sequence of ML estimators #(x l_1 ) 
to predict data X{. Crucially, the data on which ML estimators are evaluated has not 
been used in constructing the ML estimators themselves. This makes the prediction 
scheme 'honest' (different data are used for training and testing) and explains why it 
automatically protects us against overfitting. 

Example 2.19 [Laplace and Jeffreys] Consider the prequential distribution for the 
Bernoulli model, Example 12.71 defined as in (|2.28|) . We show that if we take 9 in (|2.28|) 
equal to the ML estimator nm/n, then the resulting -Ppi ug _i n is not a universal model; 
but a slight modification of the ML estimator makes -P p i ug -in a very good universal 
model. Suppose that re > 3 and (a?i, X2, £3) = (0, 0, 1) - a not-so-unlikely initial segment 
according to most 9. Then -P p i U g-inP^3 = 1 I X\,X2) = P(X = 1 \ 9{x\,X2)) = 0, so that 
by TFM . 

-logPpl ug _ in (x n ) > -logPpl ug _i n (x 3 I X 1 ,X 2 ) = OO, 

whence -P p i ug _in is not universal. Now let us consider the modified ML estimator 

rem + A 
n + IX 

If we take A = 0, we get the ordinary ML estimator. If we take A = 1, then an exercise 
involving beta-integrals shows that, for all i,x l , P{Xi \ Oi^x 1 " 1 )) = ftayesPQ | x % ~ 1 ), 
where -Peayes is defined relative to the uniform prior w(9) = 1. Thus 9i(x l ) corre- 
sponds to the Bayesian predictive distribution for the uniform prior. This prediction 
rule was advocated by the great probabilist P.S. de Laplace, co-originator of Bayesian 
statistics. It may be interpreted as ML estimation based on an extended sample, con- 
taining some 'virtual' data: an extra and an extra 1. 

Even better, a similar calculation shows that if we take A = 1/2, the resulting esti- 
mator is equal to PBa.yes(Xi I x l ~ x ) defined relative to Jeffreys' prior. Asymptotically, 
-^Bayes with Jeffreys' prior achieves the same codelengths as P nm \ (Section I2,fi.3|) , It 
follows that -Ppiug-in with the slightly modified ML estimator is asymptotically indistin- 
guishable from the optimal universal model P nm \\ 

For more general models M , such simple modifications of the ML estimator usually 
do not correspond to a Bayesian predictive distribution; for example, if Ai is not convex 
(closed under taking mixtures) then a point estimator (an element of M.) typically does 
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not correspond to the Bayesian predictive distribution (a mixture of elements of M). 
Nevertheless, modifying the ML estimator by adding some virtual data yx, . . . , y m and 
replacing P(X t | 0(x i-1 )) by P{X t | 0(x i-1 ,y m )) in the definition ¥TM may still lead 
to good universal models. This is of great practical importance, since, using (|2.29f) . 
— logPpi U g_i n (x n ) is often much easier to compute than — log-pBayes(^ n )- 

Summary We introduced the refined MDL Principle for model selection in a re- 
stricted setting. Refined MDL amounts to selecting the model under which the data 
achieve the smallest stochastic complexity, which is the codelength according to the 
minimax optimal universal model. We gave an asymptotic expansion of stochastic and 
parametric complexity, and interpreted these concepts in four different ways. 

2.7 General Refined MDL: Gluing it All Together 

In the previous section we introduced a 'refined' MDL principle based on minimax re- 
gret. Unfortunately, this principle can be applied only in very restricted settings. We 
now show how to extend refined MDL, leading to a general MDL Principle, applicable 
to a wide variety of model selection problems. In doing so we glue all our previous 
insights (including 'crude MDL') together, thereby uncovering a single general, under- 
lying principle, formulated in Figure l2~H Therefore, if one understands the material in 
this section, then one understands the Minimum Description Length Principle. 

First, Section l2.7.1( we show how to compare infinitely many models. Then, Sec- 
tion 12. 7,21 we show how to proceed for models M. for which the parametric complexity 
is undefined. Remarkably, a single, general idea resides behind our solution of both 
problems, and this leads us to formulate, in Section 12.7.31 a single, general refined 
MDL Principle. 

2.7.1 Model Selection with Infinitely Many Models 

Suppose we want to compare more than two models for the same data. If the number 
to be compared is finite, we can proceed as before and pick the model J\A^ k > with 
smallest — log-P nm i(x n | .M' '). If the number of models is infinite, we have to be more 
careful. Say we compare models MS 1 ' , M^ 2 ', . . . for data x n . We may be tempted to 
pick the model minimizing — logP nm i(x n | MS ') over all k G {1,2, . . .}, but in some 
cases this gives unintended results. To illustrate, consider the extreme case that every 
MS k > contains just one distribution. For example, let M^ = {P\},M^ = {P2}, ■ ■ ■ 
where {Pi, P2, . . .} is the set of all Markov chains with rational-valued parameters. In 
that case, COMP n (.A/i' ') = for all k, and we would always select the maximum 
likelihood Markov chain that assigns probability 1 to data x n . Typically this will be a 
chain of very high order, severely overfitting the data. This cannot be right! A better 
idea is to pick the model minimizing 

-lo gj P nml (x"|^( fc )) + L(A;), (2.32) 
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where L is the codelength function of some code for encoding model indices k. We would 
typically choose the standard prior for the integers, L(k) = 2 log k + 1, Example l2.4l By 
using (J2.32J) we avoid the overfitting problem mentioned above: if M^ l > = {Pi}, M^- 2 ' = 
{P2}, ■ ■ • where Pi,P2, ... is a list of all the rational-parameter Markov chains, (|2.32|) 
would reduce to two-part code MDL (Section 12. 4JI which is asymptotically consistent. 
On the other hand, if M^ k > represents the set of A;-th order Markov chains, the term L(k) 
is typically negligible compared to COMP n (M"), the complexity term associated 
with J\A^ that is hidden in — log P n m } {MS k ^ ) : thus, the complexity of J\A^ comes 
from the fact that for large k, MS ' contains many distinguishable distributions; not 
from the much smaller term L(k) ~ 2 log k. 



To make our previous approach for a finite set of models compatible with 1)2. 32|) . 
we can reinterpret it as follows: we assign uniform codelengths (a uniform prior) to the 
M.^\ . . . ,M.( M ' under consideration, so that for k = 1, ... , M, L(k) = log M. We then 
pick the model minimizing (|2.32j) . Since L(k) is constant over k, it plays no role in the 
minimization and can be dropped from the equation, so that our procedure reduces to 
our original refined MDL model selection method. We shall henceforth assume that we 
always encode the model index, either implicitly (if the number of models is finite) or 
explicitly. The general principle behind this is explained in Section 12.7.31 

2.7.2 The Infinity Problem 

For some of the most commonly used models, the parametric complexity COMP(A^) 
is undefined. A prime example is the Gaussian location model, which we discuss below. 
As we will see, we can 'repair' the situation using the same general idea as in the 
previous subsection. 

Example 2.20 Parametric Complexity of the Normal Distributions Let A4 

be the family of Normal distributions with fixed variance a 2 and varying mean //, 
identified by their densities 

P{x\n) ~- 



2vr<T 



extended to sequences x\, . . . , x n by taking product densities. As is well-known Casella and Berger 1990| , 
the ML estimator fi{x n ) is equal to the sample mean: fi(x n ) = n~ l X^=i x i- ^- n eas Y 
calculation shows that 

COMP n (M) = / P(x n I (x(x n ))dx n = 00, 

Jx n 

where we abbreviated dx\ . . . dx n to dx n . Therefore, we cannot use basic MDL model 
selection. It also turns out that I(fi) = a~ 2 so that 

e 
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Thus, the Bayesian universal model approach with Jeffreys' prior cannot be applied 
either. Does this mean that our MDL model selection and complexity definitions break 
down even in such a simple case? Luckily, it turns out that they can be repaired, as 
we now show. Barron, Rissanen, and Yu [1998| and Foster and Stine [2001| show that, 



for all intervals [a, b], 

f P{x n | fi(x n ))dx n = ^f3- ■ y/E. (2.33) 

Suppose for the moment that it is known that /} lies in some set [-K, K] for some fixed 
K. Let M.k be the set of conditional distributions thus obtained: Mk = {P '(■ I /•*) I 
\i £ M}, where P'(x n \ u) is the density of x n according to the normal distribution 
with mean //, conditioned on |ra _1 ^Xj| < K. By ()2.33j) . the 'conditional' minimax 
regret distribution P nm \(- | Mk) is well-defined for all K > 0. That is, for all x n with 
|£(s n )| < K, 

Pnml(x n | M K ) = P ' {Xn ! KXU)) 

Jx" : \/X(x n )\<K P '\ xn I P<{X n ))dx n , 

with regret (or 'conditional' complexity), 

r In 

COMP n (M K ) = log / P'(x n | fi(x n ))dx n = log K + - log — - log a + 1. 

J\ji(x n )\<K 2 27T 

This suggests to redefine the complexity of the full model M. so that its regret depends 
on the area in which p, falls. The most straightforward way of achieving this is to define 
a m eta- universal model for M, combining the NML with a two-part code: we encode 
data by first encoding some value for K. We then encode the actual data x n using the 
code -PnmiH-M/f). The resulting code -P me ta is a universal code for M. with lengths 

- logP mcta (x"|.M):=min {- lo gj P mcta (x n | M K ) + L{K)} . (2.34) 

The idea is now to base MDL model selection on P m eta,('\-M) as in (|2,34|) rather than 
on the (undefined) P n m\('\-M). To make this work, we need to choose L in a clever 
manner. A good choice is to encode K' = log K as an integer, using the standard code 
for the integers. To see why, note that the regret of -Pmeta now becomes: 

" logPmetaO^ I M) - [- log P( X n | /!(**))] = 

min {log K + - log log a + 1 + 2 log [log K] } + 1 < 

K:\ogKe{l,2,...} L 2 2tt ' 

1 72 

log \fi{x n )\ + 21oglog \fi{x n )\ + - log — - loga + 4 < 

2 Z7T 

COMP„(7W| A |) + 21ogCOMP„(7W| A |)+3. (2.35) 

If we had known a good bound K on \fi\ a priori, we could have used the NML model 
-Pnmi(' | M-k)- With 'maximal' a priori knowledge, we would have used the model 
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-PnmiO | ■A / ti£i), leading to regret COMP n (.M|£|). The regret achieved by P me ta is 
almost as good as this 'smallest possible regret-with-hindsight' COMP ra (7Wui): the 
difference is much smaller than, in fact logarithmic in, COMP„(7Wi/i|) itself, no matter 
what x n we observe. This is the underlying reason why we choose to encode K with 
log-precision: the basic idea in refined MDL was to minimize worst-case regret, or 
additional code-length compared to the code that achieves the minimal code- length 
with hindsight. Here, we use this basic idea on a meta-level: we design a code such 
that the additional regret is minimized, compared to the code that achieves the minimal 
regret with hindsight. 



This meta-two-part coding idea was introduced by Rissanen [1996|. It can be extended 



to a wide range of models with COMP„(.M) = oo; for example, if the Xj represent 
outcomes of a Poisson or geometric distribution, one can encode a bound on \x just like 
in Example 12.201 If J\A is the full Gaussian model with both /j, and a 2 allowed to vary, 
one has to encode a bound on p, and a bound on a 2 . Essentially the same holds for 
linear regression problems, Section [2. 



Renormalized Maximum Likelihood Meta-two-part coding is just one possible 
solution to the problem of undefined COMP„(.M). It is suboptimal, the main reason 
being the use of 2-part codes. Indeed, these 2-part codes are not complete (Section l2.2|) : 
they reserve several codewords for the same data D = (x±, . . . ,x n ) (one for each in- 
teger value of log if); therefore, there must exist more efficient (one-part) codes -P^eta 
such that for all x n G X n , P^ etab {x n ) > P me tn(x n ); in keeping with the idea that 
we should minimize description length, such alternative codes are preferable. This 
realization has led to a search for more efficient and intrinsic solutions to the prob- 
lem. Foster and Stine [2001] consider the possibility of restricting the parameter values 



rather than the data, and develop a general framework for comparing universal codes 
for models with undefined COMP(A^). Ris sanen [2001| suggests the following elegant 
solution. He defines the Renormalized Maximum Likelihood (RNML) distribution -Prnmi- 
In our Gaussian example, this universal model would be defined as follows. Let K(x n ) 
be the bound on jl(x n ) that maximizes P n mi(^ n I -M-k) for the actually given K. That 
is, K(x n ) = \fi{x n )\. Then P rnm i is defined as, for all x n G X n , 

Pmml(x n \M) = ' I KA \A n ^^ 

Jx n €R n "nml(£ n | ■M.ft( x n^)dx n 

Model selection between a finite set of models now proceeds by selecting the model 
maximizing the re -normalized likelihood (|2.36|) . 

Region Indifference All the approaches considered thus far slightly prefer some 
regions of the parameter space over others. In spite of its elegance, even the Rissanen 
renormalization is slightly 'arbitrary' in this way: had we chosen the origin of the 
real line differently, the same sequence x n would have achieved a different codelength 
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— logP rnm i(x n \ M). In recent work, Liang and Barron |2UU4al I2004b| consider a 
novel and quite different approach for dealing with infinite COMP n (A4) that partially 
addresses this problem. They make use of the fact that, while Jeffreys' prior is improper 
(fy/\I(6)\d6 is infinite), using Bayes' rule we can still compute Jeffreys' posterior based 
on the first few observations, and this posterior turns out to be a proper probability 
measure after all. Liang and Barron use universal models of a somewhat different type 
than -P nm i, so it remains to be investigated whether their approach can be adapted to 
the form of MDL discussed here. 

2.7.3 The General Picture 

Section 12.7.11 illustrates that, in all applications of MDL, we first define a single uni- 
versal model that allows us to code all sequences with length equal to the given sample 
size. If the set of models is finite, we use the uniform prior. We do this in order to be as 
'honest' as possible, treating all models under consideration on the same footing. But 
if the set of models becomes infinite, there exists no uniform prior any more. Therefore, 
we must choose a non-uniform prior /non-fixed length code to encode the model index. 
In order to treat all models still 'as equally as possible', we should use some code which 
is 'close' to uniform, in the sense that the codelength increases only very slowly with 
k. We choose the standard prior for the integers ( Example 12.4(1 . but we could also have 
chosen different priors, for example, a prior P(k) which is uniform on k = 1..M for 
some large M, and P{k) oc k~ 2 for k > M . Whatever prior we choose, we are forced 
to encode a slight preference of some models over others; see Section f2.10.il 

Section 12.7.21 applies the same idea, but implemented at a meta-level: we try to 
associate with MS ' a code for encoding outcomes in X n that achieves uniform (= 
minimax) regret for every sequence x n . If this is not possible, we still try to assign 
regret as 'uniformly' as we can, by carving up the parameter space in regions with 
larger and larger minimax regret, and devising a universal code that achieves regret not 
much larger than the minimax regret achievable within the smallest region containing 
the ML estimator. Again, the codes we used encoded a slight preference of some regions 
of the parameter space over others, but our aim was to keep this preference as small as 
possible. The general idea is summarized in Figure l2~H which provides an (informal) 
definition of MDL, but only in a restricted context. If we go beyond that context, 
these prescriptions cannot be used literally - but extensions in the same spirit suggest 
themselves. Here is a first example of such an extension: 

Example 2.21 [MDL and Local Maxima in the Likelihood] In practice we often 
work with models for which the ML estimator cannot be calculated efficiently; or at 
least, no algorithm for efficient calculation of the ML estimator is known. Examples are 
finite and Gaussian mixtures and Hidden Markov models. In such cases one typically 
resorts to methods such as EM or gradient descent, which find a local maximum of 
the likelihood surface (function) P(x n \ 9), leading to a local maximum likelihood 
estimator (LML) 6{x n ). Suppose we need to select between a finite number of such 
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GENERAL 'REFINED' MDL PRINCIPLE for Model Selection 

Suppose we plan to select between models M^ l \M^ 2 \... for data D = 
(x\, . . . ,x n ). MDL tells us to design a universal code P for X n , in which the 
index k of MS ' is encoded explicitly. The resulting code has two parts, the two 
sub-codes being defined such that 

1. All models MS ' are treated on the same footing, as far as possible: we assign 
a uniform prior to these models, or, if that is not a possible, a prior 'close to' 
uniform. 

2. All distributions within each MS ' are treated on the same footing, as far 
as possible: we use the minimax regret universal model -P nm i(x n \ M^ k >). If 
this model is undefined or too hard to compute, we instead use a different 
universal model that achieves regret 'close to' the minimax regret for each 
submodel of M^ k ' in the sense of (|2.35j) . 

In the end, we encode data D using a hybrid two-part/one-part universal model, 
explicitly encoding the models we want to select between and implicitly encoding 
any distributions contained in those models. 



Figure 2.4: The Refined MDL Principle. 



r.9 



models. We may be tempted to pick the model A4 maximizing the normalized likelihood 
Pnmi(x n | M). However, if we then plan to use the local estimator 9{x n ) for predicting 
future data, this is not the right thing to do. To see this, note that, if suboptimal 
estimators 9 are to be used, the ability of model M to fit arbitrary data patterns may 
be severely diminished! Rather than using P nm i, we should redefine it to take into 
account the fact that 9 is not the global ML estimator: 

p/ r -v- P(* n \o\x n )) 

r nra\\ x )■' 



E^e^ p (x n \0(x n )) 
leading to an adjusted parametric complexity 

COMP;(7W):=log Y, P(x n \0(x n )), (2.37) 

x n ex n 

which, for every estimator 9 different from 9 must be strictly smaller than COMP n (Ai). 

Summary We have shown how to extend refined MDL beyond the restricted settings 
of Section 12.61 This uncovered the general principle behind refined MDL for model 
selection, given in Figure l2~H General as it may be, it only applies to model selection 
- in the next section we briefly discuss extensions to other applications. 

2.8 Beyond Parametric Model Selection 

The general principle as given in Figure 12.41 only applies to model selection. It can 
be extended in several directions. These range over many different tasks of inductive 
inference - we mention prediction, transduction (as defined in |Vapnik 199 8 ), cluster- 
ing [kontka nen, Myllymaki, Buntine, Rissanen, and Tirri 2004| and similarity detec- 
tion |Li, Chen, Li, Ma, and Vitanyi 2003| . In these areas there has been less research 
and a 'definite' MDL approach has not yet been formulated. 

MDL has been developed in some detail for some other inductive tasks: non- 
parametric inference, parameter estimation and regression and classification problems. 
We give a very brief overview of these - for details we refer to |Barron, Rissanen, and Yu 1998[ 
IHansen and Yu 2001| and, for the classification case, Griinwald and Langford 2004| . 



Non-Parametric Inference Sometimes the model class A4 is so large that it can- 
not be finitely parameterized. For example, let X = [0, 1] be the unit interval and let 
M be the i.i.d. model consisting of all distributions on X with densities / such that 
— log/(x) is a continuous function on X. M is clearly 'non-parametric': it cannot be 
meaningfully parameterized by a connected finite-dimensional parameter set 0' ' C R . 
We may still try to learn a distribution from M in various ways, for example by his- 
togram density estimation |Rissanen, Speed, and Yu 1992 or kernel density estimation 



Ris sanen 1989J . MDL is quite suitable for such applications, in which we typically se- 
lect a density / from a class MS n > C M, where MS n ' grows with n, and every P* 6 Ai 
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can be arbitrarily well approximated by members of Ai^ n ' , A4^ n+l \ . . . in the sense 
that lim. n ^ 00 mi PeM ( n ) D(P*\\P) = |Barron, Rissanen, and Yu 1998 . Here D is the 



Kullback-Leibler divergence [Cover and Thomas 1991] between P* and P. 

MDL Parameter Estimation: Three Approaches The 'crude' MDL method 
(Section 12.4(1 was a means of doing model selection and parameter estimation at the 
same time. 'Refined' MDL only dealt with selection of models. If instead, or at the 
same time, parameter estimates are needed, they may be obtained in three different 
ways. Historically the first way (Rissanen 19891 [Hansen and Yu 2001 j was to simply use 
the refined MDL Principle to pick a parametric model MS \ and then, within MS', 
pick the ML estimator 0^>. After all, we associate with J\4( k > the distribution P nm i 
with codelengths 'as close as possible' to those achieved by the ML estimator. This 
suggests that within M\ k >, we should prefer the ML estimator. But upon closer inspec- 
tion, Figure E31 suggests to use a two-part code also to select 9 within M.S k >; namely, 
we should discretize the parameter space in such a way that the resulting 2-part code 
achieves the minimax regret among all two-part codes; we then pick the (quantized) 9 
minimizing the two-part code length. Essentially this approach has been worked out 
in detail by Barron and Cover [1991] . The resulting estimators may be called two-part 



code MDL estimators. A third possibility is to define predictive MDL estimators such 
as the Laplace and Jeffreys estimators of Example 12.191 once again, these can be un- 
derstood as an extension of Figure 12.41 Barron, Rissanen, and Yu 1998 . These second 



and third possibilities are more sophisticated than the first. However, if the model M is 
finite- dimensional parametric and n is large, then both the two-part and the predictive 
MDL estimators will become indistinguishable from the maximum likelihood estima- 
tors. For this reason, it has sometimes been claimed that MDL parameter estimation 
is just ML parameter estimation. Since for small samples, the estimates can be quite 
different, this statement is misleading. 

Regression In regression problems we are interested in learning how the values 
yi, . . . ,y n of a regression variable Y depend on the values x\,. . . ,x n of the regres- 
sor variable X. We assume or hope that there exists some function h : X — > y so that 
h{X) predicts the value Y reasonably well, and we want to learn such an h from data. 
To this end, we assume a set of candidate predictors (functions) 7i. In Example ll.2( we 
took Ti. to be the set of all polynomials. In the standard formulation of this problem, 
we take h to express that 

Yi = h(Xi) + Z i} (2.38) 

where the Z% are i.i.d. Gaussian random variables with mean and some variance 
a 2 , independent of JQ. That is, we assume Gaussian noise: Q2.38JI implies that the 
conditional density oiyi,...,y n , given x\, . . . ,x n , is equal to the product of n Gaussian 
densities: 

p{ «" l x "'°' h) = (s)'^(- s '"y WI ) P- 39 > 
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When the Codelength for x n Can Be Ignored 

If all models under consideration represent conditional densities or probability mass 
functions P(Y \ X), then the codelength for X\, . . . ,X n can be ignored in model 
and parameter selection. Examples are applications of MDL in classification and 
regression. 



Figure 2.5: Ignoring codelengths. 

With this choice, the log-likelihood becomes a linear function of the squared error: 

1 n 
-logP(y" | x n ,a,h) = —Y^(yi ~ h{ Xi )f + ^log2w 2 . (2.40) 

i=l 

Let us now assume that TL = Uk>iTC^ ' where for each k, 7v ' is a set of functions 
h : X — > y. For example, Tv- > may be the set of £;-th degree polynomials. 

With each model Tv- k > we can associate a set of densities (J2.39J1 . one for each (h, a 2 ) 
with h G H^ and a 2 G ~R + . Let M.^ k > be the resulting set of conditional distributions. 
Each P(- | h,a 2 ) G .A/f' ' is identified by the parameter vector (ao, • • • ,ak,o~ 2 ) so that 
h(x):= ^ 7=0 ajX J . By Section \'2. 7. 11 ()2.8() MDL tells us to select the model minimizing 

-logP(y n \M {k) ,x n )+L(k) (2.41) 

where we may take L{k) = 2 log fe + 1, and P(- \ MS>, ■) is now a conditional universal 
model with small minimax regret. Q2.41JI ignores the codelength of x\, . . . ,x n . Intu- 
itively, this is because we are only interested in learning how y depends on x; therefore, 
we do not care how many bits are needed to encode x. Formally, this may be under- 
stood as follows: we really are encoding the x-values as well, but we do so using a 
fixed code that does not depend on the hypothesis h under consideration. Thus, we 
are really trying to find the model JVi^ > minimizing 

- log P(y n \M (k) , x n ) + L(k) + L'(x n ) 

where V represents some code for X n . Since this codelength does not involve k, it 
can be dropped from the minimization; see Figure [2*31 We will not go into the precise 
definition of P(y n \ Ai^ k ',x n ). Ideally, it should be an NML distribution, but just as in 



Example 12.201 this NML distribution is not well-defined. We can get reasonable alter- 
native universal models after all using any of the methods described in Section 12.7.21 
see |Barron, Rissanen, and Yu 1998 and Ris sanen 2000J for details. 



'Non-probabilistic' Regression and Classification In the approach we just de- 
scribed, we modeled the noise as being normally distributed. Alternatively, it has been 
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tried to directly try to learn functions h £ TL from the data, without making any prob- 
abilistic assumptions about the noise |Rissanen 1989| IBarron 1990| lYamanishi 1 998: 
IGriinwald 1998| IGriinwald 199 9 . The idea is to learn a function h that leads to good 
predictions of future data from the same source in the spirit of Vapnik's |1998j statis- 
tical learning theory. Here prediction quality is measured by some fixed loss function; 
different loss functions lead to different instantiations of the procedure. Such a version 
of MDL is meant to be more robust, leading to inference of a 'good' h S TC irrespective 
of the details of the noise distribution. This loss-based approach has also been the 
method of choice in applying MDL to classification problems. Here y takes on values 
in a finite set, and the goal is to match each feature X (for example, a bit map of 
a handwritten digit) with its corresponding label or class (e.g., a digit). While sev- 
eral versions of MDL for classification have been proposed IQuinlan and Rivest 1 989; 
Eiss anen 1989| Kearns, Mansour, Ng, and Ron 1997 , most of these can be reduced to 



the same approach based on a 0/1-valued loss function |Griinwald 1998J . In recent 
work |Griinwald and Langford 2004] we show that this MDL approach to classification 
without making assumptions about the noise may behave suboptimally: we exhibit 
situations where no matter how large n, MDL keeps overfitting, selecting an overly 
complex model with suboptimal predictive behavior. Modifications of MDL suggested 
by Barron [1990| and Yamanishi [1998] do not suffer from this defect, but they do not 



admit a natural coding interpretation any longer. All in all, current versions of MDL 
that avoid probabilistic assumptions are still in their infancy, and more research is 
needed to find out whether they can be modified to perform well in more general and 
realistic settings. 

Summary In the previous sections, we have covered basic refined MDL fSection !2.6l) . 
general refined MDL fSection l2.7j) . and several extensions of refined MDL (this section). 
This concludes our technical description of refined MDL. It only remains to place MDL 
in its proper context: what does it do compared to other methods of inductive inference? 
And how well does it perform, compared to other methods? The next two sections are 
devoted to these questions. 

2.9 Relations to Other Approaches to Inductive Inference 

How does MDL compare to other model selection and statistical inference methods? In 
order to answer this question, we first have to be precise about what we mean by 'MDL'; 
this is done in Section EH3 We then continue in Section T2.9. 21 bv summarizing MDL's 
relation to Bayesian inference, Wallace's MML Principle, Dawid's prequential model 
validation, cross-validation and an 'idealized' version of MDL based on Kolmogorov 
complexity. The literature has also established connections between MDL and Jaynes' 
2003] Maximum Entropy Principle jFeder 1986| Li and Vitanyi 1997} IGriinwald 1998: 



Griinwald 2000; Griinwald and Dawid 2004 and Vapnik's 1998 structural risk mini- 
mization principle |Griinwald 1998J . but there is no space here to discuss these. Rela- 
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tions between MDL and Akaike's AIC [Burnham and Anderson 200 2J are subtle. They 



are discussed by, for example, Speed and Yu [1993 



2.9.1 What is 'MDL' ? 

'MDL' is used by different authors in somewhat different meanings. Some authors 
use MDL as a broad umbrella term for all types of inductive inference based on data 
compression. This would, for example, include the 'idealized' versions of MDL based on 
Kolmogorov complexity and Wallaces's MML Principle, to be discussed below. On the 
other extreme, for historical reasons, some authors use the MDL Criterion to describe 
a very specific (and often not very successful) model selection criterion equivalent to 
BIC, discussed further below. 

Here we adopt the meaning of the term that is embraced in the survey Barron, Rissanen, and Yu 1998 



written by arguably the three most important contributors to the field: we use MDL 
for general inference based on universal models. These include, but are not limited to 
approaches in the spirit of Figure l2.41 For example, some authors have based their infer- 
ences on 'expected' rather than 'individual sequence' universal models |Barron, Rissan en, and Yu 1998} 



Liang and Barron 2004a . Moreover, if we go beyond model selection (Section l2.8jl . then 
the ideas of Figure 12.41 have to be modified to some extent. In fact, one of the main 
strengths of "MDL" in this broad sense is that it can be applied to ever more exotic 
modeling situations, in which the models do not resemble anything that is usually en- 
countered in statistical practice. An example is the model of context-free grammars, 



already suggested by Solomonoff [1964|. In this tutorial, we call applications of MDL 



that strictly fit into the scheme of Figure 12.41 refined MDL for model/hypothesis se- 
lection; when we simply say 'MDL', we mean 'inductive inference based on universal 
models'. This form of inductive inference goes hand in hand with Rissanen's radical 
MDL philosophy, which views learning as finding useful properties of the data, not 
necessarily related to the existence of a 'truth' underlying the data. This view was out- 
lined in Chapter^ Section fl. 51 Although MDL practitioners and theorists are usually 
sympathetic to it, the different interpretations of MDL listed in Section f2.6l make clear 
that MDL applications can also be justified without adopting such a radical philosophy. 

2.9.2 MDL and Bayesian Inference 

Bayesian statistics |Lee 1997|lBernardo and Smith 1994J is one of the most well-known, 
frequently and successfully applied paradigms of statistical inference. It is often claimed 
that 'MDL is really just a special case of Bayes 19 '. Although there are close similarities, 
this is simply not true. To see this quickly, consider the basic quantity in refined MDL: 



the NML distribution P nm \, Equation (|2.18jl . While P nm i - although defined in a 
completely different manner - turns out to be closely related to the Bayesian marginal 
likelihood, this is no longer the case for its 'localized' version l)2.37JI . There is no mention 
of anything like this code/distribution in any Bayesian textbook! Thus, it must be the 
case that Bayes and MDL are somehow different. 
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MDL as a Maximum Probability Principle For a more detailed analysis, we 
need to distinguish between the two central tenets of modern Bayesian statistics: (1) 
Probability distributions are used to represent uncertainty, and to serve as a basis for 
making predictions; rather than standing for some imagined 'true state of nature'. (2) 
All inference and decision- making is done in terms of prior and posterior distributions. 
MDL sticks with (1) (although here the 'distributions' are primarily interpreted as 
'codelength functions'), but not (2): MDL allows the use of arbitrary universal mod- 
els such as NML and prequential universal models; the Bayesian universal model does 
not have a special status among these. In this sense, Bayes offers the statistician less 
freedom in choice of implementation than MDL. In fact, MDL may be reinterpreted as 
a maximum probability principle, where the maximum is relative to some given model, 
in the worst-case over all sequences (Rissanen J1987I fTI)89| uses the phrase 'global max- 
imum likelihood principle'). Thus, whenever the Bayesian universal model is used in 
an MDL application, a prior should be used that minimizes worst-case codelength re- 
gret, or equivalently, maximizes worst-case relative probability. There is no comparable 
principle for choosing priors in Bayesian statistics, and in this respect, Bayes offers a 
lot more freedom than MDL. 

Example 2.22 There is a conceptual problem with Bayes' use of prior distribu- 
tions: in practice, we very often want to use models which we a priori know to be 
wrong - see Example 11.51 If we use Bayes for such models, then we are forced to 
put a prior distribution on a set of distributions which we know to be wrong - that 
is, we have degree-of-belief 1 in something we know not to be the case. From an 
MDL viewpoint, these priors are interpreted as tools to achieve short codelengths 
rather than degrees-of-belief and there is nothing strange about the situation; but 
from a Bayesian viewpoint, it seems awkward. To be sure, Bayesian inference often 
gives good results even if the model M. is known to be wrong; the point is that 
(a) if one is a strict Bayesian, one would never apply Bayesian inference to such 
misspecified A4, and (b), the Bayesian theory offers no clear explanation of why 
Bayesian inference might still give good results for such M.. MDL provides both 
codelength and predictive-sequential interpretations of Bayesian inference, which 
help explain why Bayesian inference may do something reasonable even if M. is 
misspecified. To be fair, we should add that there exists variations of the Bayesian 
philosophy (e.g. De Finetti [19741 's) which avoid the conceptual problem we just 



described. 



MDL and BIC In the first paper on MDL, Rissanen [1978| used a two-part code and 



showed that, asymptotically, and under regularity conditions, the two-part codelength 
of x n based on a fc-parameter model Ai with an optimally discretized parameter space 
is given by 

-\ogP(x n | 0(x n )) + |logn, (2.42) 

thus ignoring 0(l)-terms, which, as we have already seen, can be quite important. 
In the same year |Schwarz [1978] showed that, for large enough n, Bayesian model 
selection between two exponential families amounts to selecting the model minimizing 
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([2.42 1) . ignoring 0(l)-terms as well. As a result of Schwarz's paper, model selection 
based on ([2.42(1 became known as the BIC (Bayesian Information Criterion). Not 
taking into account the functional form of the model «M, it often does not work very 
well in practice. 

It has sometimes been claimed that MDL = BIC; for example, |Burnham and Anderson 2002( 
page 286] write "Rissanen's result is equivalent to BIC". This is wrong, even for 
the 1989 version of MDL that Burnham and Anderson refer to - as pointed out by 
Foster and Stine [2004|, the BIC approximation only holds if the number of parame- 



ters k is kept fixed and n goes to infinity. If we select between nested families of models 
where the maximum number of parameters k considered is either infinite or grows 
with n, then model selection based on both P nm \ and on Peayes tends to select quite 
different models than BIC - if k gets closer to n, the contribution to COMP n (.M) of 
each additional parameter becomes much smaller than 0.5 log n [Foster and Stine~2 004 . 
However, researchers who claim MDL = BIC have a good excuse: in early work, Rissa- 
nen himself has used the phrase 'MDL criterion' to refer to (|2.42|) . and unfortunately, 
the phrase has stuck. 

MDL and MML MDL shares some ideas with the Minimum Message Length (MML) 
Principle which predates MDL by 10 years. Key references are [Wallace and Boulton 1 968 



IWallace and Boulton 1975] and [Wallace and Freeman 1987J ; a long list is in Comley and Dowe 2004 



Just as in MDL, MML chooses the hypothesis minimizing the code- length of the data. 
But the codes that are used are quite different from those in MDL. First of all, in MML 
one always uses two-part codes, so that MML automatically selects both a model family 
and parameter values. Second, while the MDL codes such as P nm \ minimize worst-case 
relative code-length (regret), the two-part codes used by MML are designed to min- 
imize expected absolute code-length. Here the expectation is taken over a subjective 
prior distribution defined on the collection of models and parameters under considera- 
tion. While this approach contradicts Rissanen's philosophy, in practice it often leads 
to similar results. 

Indeed, Wallace and his co-workers stress that their approach is fully (subjective) 
Bayesian. Strictly speaking, a Bayesian should report his findings by citing the full 
posterior distribution. But sometimes one is interested in a single model, or hypothesis 
for the data. A good example is the inference of phylogenetic trees in biological applica- 
tions: the full posterior would consist of a mixture of several of such trees, which might 
all be quite different from each other. Such a mixture is almost impossible to interpret 
- to get insight in the data we need a single tree. In that case, Bayesians often use 
the MAP (Maximum A Posteriori) hypothesis which maximizes the posterior, or the 
posterior mean parameter value. The first approach has some unpleasant properties, 
for example, it is not invariant under reparameterization. The posterior mean approach 
cannot be used if different model families are to be compared with each other. The 
MML method provides a theoretically sound way of proceeding in such cases. 
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Figure 2.6: Rissanen's MDL, Wallace's MML and Dawid's Prequential Approach. 



2.9.3 MDL, Prequential Analysis and Cross- Validation 



In a series of papers, A. P. Dawid .1984J 119921 I1997J put forward a methodology for 
probability and statistics based on sequential prediction which he called the prequential 
approach. When applied to model selection problems, it is closely related to MDL - 
Dawid proposes to construct, for each model M^> under consideration, a 'probabil- 
ity forecasting system' (a sequential prediction strategy) where the i + 1-st outcome 
is predicted based on either the Bayesian posterior PB&yes(8\x l ) or on some estima- 
tor 9{x % ). Then the model is selected for which the associated sequential prediction 
strategy minimizes the accumulated prediction error. Related ideas were put forward 
by Hjorth [1982 under the name forward validation and Rissanen [1984| . From Sec- 
tion 12.6.41 we see that this is just a form of MDL - strictly speaking, every universal 
code can be thought of as as prediction strategy, but for the Bayesian and the plug-in 
universal models (Sections 12.6.31 12.6.4(1 the interpretation is much more natural than 
for others 20 . Dawid mostly talks about such 'predictive' universal models. On the 
other hand, Dawid's framework allows to adjust the prediction loss to be measured in 
terms of arbitrary loss functions, not just the log loss. In this sense, it is more general 
than MDL. Finally, the prequential idea goes beyond statistics: there is also a 'pre- 
quential approach' to probability theory developed by Dawid [Dawid and Vovk~1 999 
and |Shafer and Vovk [2001] . 

Note that the prequential approach is similar in spirit to cross-validation. In this 
sense MDL is related to cross-validation as well. The main differences are that in MDL 
and the prequential approach, (1) all predictions are done sequentially (the future is 
never used to predict the past), and (2) each outcome is predicted exactly once. 



2.9.4 Kolmogorov Complexity and Structure Function; Ideal MDL 

Kolmogorov complexity |Li and Vitanyi 1997| has played a large but mostly inspira- 
tional role in Rissanen's development of MDL. Over the last fifteen years, several 'ide- 
alized' versions of MDL have been proposed, which are more directly based on Kol- 
mogorov complexity theory [Barron 19 85 : Bar ron and Cover 1991[ Li and Vitanyi 1997 
Vereshchagin and Vitanyi 2002 . These are all based on two-part codes, where hypothe- 
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ses are described using a universal programming language such as C or Pascal. For 
example, in one proposal [Barron and Cover 1991] . given data D one picks the distri- 
bution minimizing 

K(P) +[- log P(D)}, (2.43) 

where the minimum is taken over all computable probability distributions, and K{P) 
is the length of the shortest computer program that, when input (x,d), outputs P(x) 
to d bits precision. While such a procedure is mathematically well-defined, it cannot 
be used in practice. The reason is that in general, the P minimizing Q2.43JI cannot 
be effectively computed. Kolmogorov himself used a variation of 1)2. 43|) in which one 
adopts, among all P with K(P) - logP(D) « K(D), the P with smallest K(P). Here 
K{D) is the Kolmogorov complexity of D, that is, the length of the shortest computer 
program that prints D and then halts. This approach is known as the Kolmogorov 
structure function or minimum sufficient statistic approach Vitanyi 2004 . In this 



approach, the idea of separating data and noise (Section I2.fi. 1|) is taken as basic, and 
the hypothesis selection procedure is defined in terms of it. The selected hypothesis may 
now be viewed as capturing all structure inherent in the data - given the hypothesis, 
the data cannot be distinguished from random noise. Therefore, it may be taken 
as a basis for lossy data compression - rather than sending the whole sequence, one 
only sends the hypothesis representing the 'structure' in the data. The receiver can 
then use this hypothesis to generate 'typical' data for it - this data should then 'look 
just the same' as the original data D. Rissanen views this separation idea as perhaps 
the most fundamental aspect of 'learning by compression'. Therefore, in recent work 
he has tried to relate MDL (as defined here, based on lossless compression) to the 
Kolmogorov structure function, thereby connecting it to lossy compression, and, as he 
puts it, 'opening up a new chapter in the MDL theory' Vereshchagin and Vitanyi 2 002; 
Vitanyi 2004"! IRissanen and Tabus 2004] . 



Summary and Outlook We have shown that MDL is closely related to, yet distinct 
from, several other methods for inductive inference. In the next section we discuss how 
well it performs compared to such other methods. 

2.10 Problems for MDL? 

Some authors have criticized MDL either on conceptual grounds (the idea makes no 
sense) [Webb 1996] Domingos 1999| or on practical grounds (sometimes it does not 



work very well in practice) |Kearns, Mansour, Ng, and Ron 1997} IPednault 20 03 . Are 
these criticisms justified? Let us consider them in turn. 

2.10.1 Conceptual Problems: Occam's Razor 

The most-often heard conceptual criticisms are invariably related to Occam's razor. 
We have already discussed in Section 11.51 of the previous chapter why we regard these 
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criticisms as being entirely mistaken. Based on our newly acquired technical knowledge 
of MDL, let us discuss these criticisms a little bit further: 

1. 'Occam's Razor (and MDL) is arbitrary' (page ITTj) If we restrict ourselves 
to refined MDL for comparing a finite number of models for which the NML distribution 
is well-defined, then there is nothing arbitrary about MDL - it is exactly clear what 
codes we should use for our inferences. The NML distribution and its close cousins, 
the Jeffreys' prior marginal likelihood -PBayes and the asymptotic expansion (|2,21|) are 
all invariant to continuous 1-to-l reparameterizations of the model: parameterizing our 
model in a different way (choosing a different 'description language') does not change 
the inferred description lengths. 

If we go beyond models for which the NML distribution is defined, and/or we com- 
pare an infinite set of models at the same time, then some 'subjectivity' is introduced - 
while there are still tough restrictions on the codes that we are allowed to use, all such 
codes prefer some hypotheses in the model over others. If one does not have an a priori 
preference over any of the hypotheses, one may interpret this as some arbitrariness 
being added to the procedure. But this 'arbitrariness' is of an infinitely milder sort 
than the arbitrariness that can be introduced if we allow completely arbitrary codes 
for the encoding of hypotheses as in crude two-part code MDL, Section [2.41 

Things get more subtle if we are interested not in model selection (find the best 
order Markov chain for the data) but in infinite-dimensional estimation (find the 
best Markov chain parameters for the data, among the set B of all Markov chains 
of each order). In the latter case, if we are to apply MDL, we somehow have to 
carve up B into subsets M^ C^W C ... C B. Suppose that we have already 
chosen A / J < - 1 - ) = B^- 1 ' as the set of 1-st order Markov chains. We normally take 
^(o) _ jg(o) ^ -^g gg^ f 0-th order Markov chains (Bernoulli distributions). But 
we could also have defined M.^ as the set of all 1-st order Markov chains with 
P(Xi+i = 1 \ Xi = 1) = P(X{ + i = | Xi = 0). This defines a one-dimensional 
subset of 0W that is not equal to B^°>. While there are several good reasons 21 for 
choosing B^ rather than M.^\ there may be no indication that B^ is somehow 
a priori more likely than M^\ While MDL tells us that we somehow have to 
carve up the full set B, it does not give us precise guidelines on how to do this 
- different carvings may be equally justified and lead to different inferences for 
small samples. In this sense, there is indeed some form of arbitrariness in this 
type of MDL applications. But this is unavoidable: we stress that this type of 
arbitrariness is enforced by all combined model/parameter selection methods - 
whether they be of the Structural Risk Minimization type |Vapnik 1998| , AIC- 
type Burnha m and Anderson 2002) . cross-validation or any other type. The only 
alternative is treating all hypotheses in the huge class B on the same footing, which 
amounts to maximum likelihood estimation and extreme overfitting. 

2. 'Occam's razor is false' (page ITT|) We often try to model real-world situations 
that can be arbitrarily complex, so why should we favor simple models? We gave in 
informal answer on page El where we claimed that even if the true data generating 
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machinery is very complex, it may be a good strategy to prefer simple models for small 
sample sizes. 

We are now in a position to give one formalization of this informal claim: it is 
simply the fact that MDL procedures, with their built-in preference for 'simple' models 
with small parametric complexity, are typically statistically consistent achieving good 
rates of convergence (page I16|) , whereas methods such as maximum likelihood which do 
not take model complexity into account are typically in-consistent whenever they are 
applied to complex enough models such as the set of polynomials of each degree or the 
set of Markov chains of all orders. This has implications for the quality of predictions: 
with complex enough models, no matter how many training data we observe, if we 
use the maximum likelihood distribution to predict future data from the same source, 
the prediction error we make will not converge to the prediction error that could be 
obtained if the true distribution were known; if we use an MDL submodel/parameter 
estimate (Section 12.8(1 . the prediction error will converge to this optimal achieveable 
error. 

Of course, consistency is not the only desirable property of a learning method, and 
it may be that in some particular settings, and under some particular performance 
measures, some alternatives to MDL outperform MDL. Indeed this can happen - see 
below. Yet it remains the case that all methods the author knows of that successfully 
deal with models of arbitrary complexity have a built-in preference for selecting sim- 
pler models at small sample sizes - methods such as Vapnik's 1998 Structural Risk 
Minimization, penalized minimum error estimators (Barron 19 90 and the Akaike crite- 
rion [Burnh am and Anderson 2002J all trade-off complexity with error on the data, the 
result invariably being that in this way, good convergence properties can be obtained. 
While these approaches measure 'complexity' in a manner different from MDL, and 
attach different relative weights to error on the data and complexity, the fundamental 
idea of finding a trade-off between 'error' and 'complexity' remains. 

2.10.2 Practical Problems with MDL 

We just described some perceived problems about MDL. Unfortunately, there are also 
some real ones: MDL is not a perfect method. While in many cases, the methods 
described here perform very well 22 there are also cases where they perform suboptimally 
compared to other state-of-the-art methods. Often this is due to one of two reasons: 



1. An asymptotic formula like ((2.21(1 was used and the sample size was not large 
enough to justify this (Navarro 2004J . 

2. -Pnmi was undefined for the models under consideration, and this was solved by 
cutting off the parameter ranges at ad hoc values (Lanterman 2004] . 

In these cases the problem probably lies with the use of invalid approximations rather 
than with the MDL idea itself. More research is needed to find out when the asymp- 
totics and other approximations can be trusted, and what is the 'best' way to deal 
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with undefined P nm \- For the time being, we suggest to avoid using (j2.21|) whenever 
possible, and to never cut off the parameter ranges at arbitrary values - instead, if 
COMP n (.M) becomes infinite, then some of the methods described in Section 12.7.21 
should be used. Given these restrictions, P nm \ and Bayesian inference with Jeffreys' 
prior are the preferred methods, since they both achieve the minimax regret. If they 
are either ill-defined or computationally prohibitive for the models under consideration, 
one can use a prequential method or a sophisticated two-part code such as described 
by Barron and Cover [1991|. 



MDL and Misspecification However, there is a class of problems where MDL is 
problematic in a more fundamental sense. Namely, if none of the distributions un- 
der consideration represents the data generating machinery very well, then both MDL 
and Bayesian inference may sometimes do a bad job in finding the 'best' approxima- 
tion within this class of not-so-good hypotheses. This has been observed in practice 23 
Kearns, Mansour, Ng, and Ron 1997[lClarke 20021 IPednault 2003| . |Griinwald and Langford [2004| 



show that MDL can behave quite unreasonably for some classification problems in which 
the true distribution is not in A4. This is closely related to the problematic behavior of 
MDL for classification tasks as mentioned in Section 12.81 All this is a bit ironic, since 
MDL was explicitly designed not to depend on the untenable assumption that some 
P* £ M. generates the data. But empirically we find that while it generally works quite 
well if some P* £ M generates the data, it may sometimes fail if this is not the case. 

2.11 Conclusion 

MDL is a versatile method for inductive inference: it can be interpreted in at least 
four different ways, all of which indicate that it does something reasonable. It is 
typically asymptotically consistent, achieving good rates of convergence. It achieves all 
this without having been designed for consistency, being based on a philosophy which 
makes no metaphysical assumptions about the existence of 'true' distributions. All this 
strongly suggests that it is a good method to use in practice. Practical evidence shows 
that in many contexts it is, in other contexts its behavior can be problematic. In the 
author's view, the main challenge for the future is to improve MDL for such cases, by 
somehow extending and further refining MDL procedures in a non ad-hoc manner. I 
am confident that this can be done, and that MDL will continue to play an important 
role in the development of statistical, and more generally, inductive inference. 

Further Reading MDL can be found on the web at www.mdl-research.org. Good 



places to start further exploration of MDL are B arron, Rissanen, and Yu 1998 and 
[Hansen and Yu 2001 j . Both papers provide excellent introductions, but they are geared 
towards a more specialized audience of information theorists and statisticians, respec- 
tively. Also worth reading is Rissanen's |1989j monograph. While outdated as an 
introduction to MDL methods, this famous 'little green book' still serves as a great 
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introduction to Rissanen's radical but appealing philosophy, which is described very 
eloquently. 
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Notes 

1. But see Section EHHl 

2. Working directly with distributions on infinite sequences is more elegant, but it requires measure 
theory, which we want to avoid here. 

3. Also known as instantaneous codes and called, perhaps more justifiably, 'prefix-free' codes in 
|Li and Vitanyi 1997| . 

4. For example, with non-integer codelengths the notion of 'code' becomes invariant to the size of the 
alphabet in which we describe data. 

5. As understood in elementary probability, i.e. with respect to Lebesgue measure. 

6. Even if one adopts a Bayesian stance and postulates that an agent can come up with a (subjective) 
distribution for every conceivable domain, this problem remains: in practice, the adopted distribution 
may be so complicated that we cannot design the optimal code corresponding to it, and have to use 
some ad Zioc-instead. 

7. Henceforth, we simply use 'model' to denote probabilistic models; we typically use TL to denote 
sets of hypotheses such as polynomials, and reserve M for probabilistic models. 

8. The terminology 'crude MDL' is not standard. It is introduced here for pedagogical reasons, 
to make clear the importance of having a single, unified principle for designing codes. It should 
be noted that Rissanen's and Barron's early theoretical papers on MDL already contain such prin- 
ciples, albeit in a slightly different form than in their recent papers. Early practical applications 

Quinlan and Rivest 1989 Griinwald 1996 often do use ad hoc two-part codes which really are 'crude' 
in the sense defined here. 

9. See the previous endnote. 

10. but see Griinwald 1998 , Chapter 5 for more discussion. 

11. See Section fl.5l of Chapter^for a discussion on the role of consistency in MDL. 

12. See, for example Ba rron and Cover 1991| . (Barron 1985| 

13. Strictly speaking, the assumption that n is given in advance (i.e., both encoder and decoder know 
n) contradicts the earlier requirement that the code to be used for encoding hypotheses is not allowed 
to depend on n. Thus, strictly speaking, we should first encode some n explicitly, using 2 log n + 1 bits 
(Example H|1J, and then pick the n (typically, but not necessarily equal to the actual sample size) that 
allows for the shortest three-part codelength of the data (first encode n, then (k,9), then the data). 
In practice this will not significantly alter the chosen hypothesis, unless for some quite special data 
sequences. 

14. As explained in Figure l2~2l we identify these codes with their length functions, which is the only 
aspect we are interested in. 

15. The reason is that, in the full Bernoulli model with parameter 9 G [0, 1], the maximum likelihood 
estimator is given by n\/n, see Example 12. 71 Since the likelihood logP(a; n | 9) is a continuous function 
of 9, this implies that if the frequency ni/ra in x n is approximately (but not precisely) j/10, then the 
ML estimator in the restricted model {0.1, . . . , 0.9} is still given by 9 = j/10. Then logP(x n \9) is 
maximized by 9 = j/10, so that the L £ C that minimizes codelength corresponds to 9 = j/10. 

16. What we call 'universal model' in this text is known in the literature as a 'universal model in the 
individual sequence sense' - there also exist universal models in an 'expected sense', see Section [2.9.11 
These lead to slightly different versions of MDL. 
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17. To be fair, we should add that this naive version of GLRT is introduced here for educational 
purposes only. It is not recommended by any serious statistician! 

18. The standard definition of Fisher information |Kass and Voss 1997| is in terms of first derivatives 
of the log-likelihood; for most parametric models of interest, the present definition coincides with the 
standard one. 

19. The author has heard many people say this at many conferences. The reasons are probably his- 
torical: while the underlying philosophy has always been different, until Rissanen introduced the use 
of -Pnmi, most actual implementations of MDL 'looked' quite Bayesian. 

20. The reason is that the Bayesian and plug-in models can be interpreted as probabilistic sources. 
The NML and the two-part code models are no probabilistic sources, since p' n ' and p' n+1 ) are not 
compatible in the sense of Section [2.21 

21. For example, S' ' is better interpretable. 

22. We mention Hansen a nd Yu 20 00 Hanse n and Yu 2001 reporting excellent behavior of MDL in re- 
gression contexts; and Allen, Madani, and Greincr 2003 Kontkancn, Myllymaki, Silander, and Tirri 1999 
|Modha and M asry 1998] reporting excellent behavior of predictive (prcqucntial) coding in Bayesian net- 
work model selection and regression. Also, 'objective Bayesian' model selection methods are frequently 
and successfully used in practice [Kass and~W asscrman 1996]. Since these are based on non-informative 
priors such as Jeffreys', they often coincide with a version of refined MDL and thus indicate successful 
performance of MDL. 

23. But see fVis wanathaii,, Wallace, Dowc, and Korb [1999] who point out that the problem of [Kearns, Mansour, Ng, and Ron 1997 
disappears if a more reasonable coding scheme is used. 
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